OpenSHMEM 2015

AGENDA - Technical Presentations and Invited Talks

Wednesday August 5th - Technical Presentations

7:00 AM: Breakfast (Provided)

8:00 AM: Introduction
Neena Imam and Manjunath Gorentla Venkata

8:15 AM Keynote: Eight Unquestioned Assumptions Blocking SHMEM Exascale Computing
Dr. Thomas Sterling, Indiana University

The extraordinary momentum of Moore's Law has advanced HPC to the Petaops performance regime even as this exponential progress of the enabling technologies is asymptotically flat-lining near the nanometer threshold. However HPC system architecture and programming models have struggled to keep up with increasingly complicated heterogeneous structures and multi-layered programming methods imposed on the user community. While SHMEM models convey unifying principles to return to efficient and scalable computing, a number of underlying assumptions that go unquestioned continue to force likely first generation exascale HPC to an increasingly limited form. In part these include restrictions of the commercial market, legacy codes, irrelevant benchmarking, and a culture of evolutionary incrementalism. Among the consequences of this lemming like community-wide approach are reduced generality, lack of performance portability, poorer efficiency, and degraded user productivity. This presentation will make explicit eight unquestioned assumptions permeating the conventional HPC trajectory and describe alternative advances based on the ParalleX execution model that will make possible the opportunities of future SHMEM exascale computing.


Session 1
Session Chair - Stephen Poole


9:05 AM: Dynamic Analysis to Support Program Development with the Textually Aligned Property for OpenSHMEM Collectives
Andreas Knüpfer, Tobias Hilbrich, Joachim Protze and Joseph Schuchart

The development of correct high performance computing applications is challenged by software defects that result from parallel programming. We present an automatic tool that provides novel correctness capabilities for application developers of OpenSHMEM applications. These applications follow a Single Program Multiple Data (SPMD) model of parallel programming. A strict form of SPMD programming requires that certain types of operations are textually aligned, i.e., they need to be called from the same source code line in every process. This paper proposes and demonstrates run-time checks that assert such behavior for OpenSHMEM collective communication calls. The resulting tool helps to check program consistency in an automatic and scalable fashion. We introduce the types of checks that we cover and include strict checks that help application developers to detect deviations from expected program behavior. Further, we discuss how we can utilize a parallel tools infrastructure to achieve a scalable and maintainable implementation for these checks. Finally, we discuss an extension of our checks towards further types of OpenSHMEM operations.

9:30AM: Check-pointing Approach for Fault Tolerance in OpenSHMEM
Pengfei Hao, Swaroop Pophale, Pavel Shamis, Anthony Curtis and Barbara Chapman

Fault tolerance for long running applications is critical to guard against failure of either compute resources or network. To accomplish this in software is a non-trivial task and there is a added level of complexity for implementing a working model for one-sided communications library like OpenSHMEM; because there is no matching communication call at the target processing element (PE). In this paper we explore two fault tolerance methods based on the check-point and restart scheme that cater to the one-sided nature of PGAS programming model while leveraging the features unique to OpenSHMEM. Both methods address different fault situations, but through working implementations, with the 1-D Jacobi code, show that they are scalable and provide considerable computation resource saving.

9:55 AM: From MPI to OpenSHMEM: Porting LAMMPS
Chunyan Tang, Aurelien Bouteiller, Thomas Herault, George Bosilca and Manjunath Gorentla Venkata

This work details the opportunities and challenges of porting a Petascale, MPI- based application - LAMMPS - to OpenSHMEM. We investigate the major programming challenges stemming from the differences in communication semantics, address space organization, and synchronization operations between the two programming models. This work provides several approaches to solve those challenges for representative communication patterns in LAMMPS, e.g., by utilizing group synchronization, check peer's buffer status and unpacked scattered data direct transfer. The performance of LAMMPS is evaluated on the Titan HPC system at ORNL. The OpenSHMEM implementations are compared with MPI versions in terms of both strong and weak scaling. The results outline that OpenSHMEM provides a rich semantic to implement scalable scientific applications. In addition, the experiments demonstrate that OpenSHMEM can compete with, and often improve on the optimized MPI implementation.


Session 2
Session Chair - Pavel Shamis


INVITED TALK
10:30 AM: Invited Talk: Intel's Multifaceted PGAS Activities and Community Engagements
Ulf Hanebutte, Intel

In this talk, we will discuss recent PGAS research work, new approaches to HPC networking, and proposed extensions to the OpenSHMEM specification.

11:20 AM: Extending the Strided Communication Interface in OpenSHMEM
Naveen Namashivayam, Dounia Khaldi, Deepak Eachempati and Barbara Chapman

OpenSHMEM is a library interface specification which has resulted from a unification effort among various vendors and users of SHMEM libraries. OpenSHMEM includes routines which aim to support a PGAS programming model, encompassing data management, one-sided communication, atomics, synchronization, collectives, and mutual exclusion. In the work described in this paper, we investigated the usage and performance of strided communication routines. Moreover, we propose and describe an implementation for new strided communication routines, shmem iputmem and shmem igetmem, which enable a more general means for expressing communications entailing data transfers for two-dimensional subarrays or for arrays of structures. We demonstrate the use of these routines on a halo exchange benchmark for which we achieved, on average, a 64.27% improvement compared to the baseline implementation using non-strided communication routines and also 63.37% improvement compared to the one using existing strided communication routines.

11:45 AM: Graph 500 in OpenSHMEM
Eduardo D'Azevedo and Neena Imam

This document describes the effort to implement the Graph 500 benchmark using OpenSHMEM based on the MPI-2 one-side version. The Graph 500 benchmark performs a breadth- first search in parallel on a large randomly generated undirected graph and can be implemented using basic MPI-1 and MPI-2 one-sided communication. Graph 500 requires atomic bit-wise operations on unsigned long integers but neither atomic bit-wise operations nor OpenSHMEM for unsigned long are available in OpenSHEM. Such needed bit-wise atomic operations and support for unsigned long are implemented using atomic condition swap (CSWAP) on signed long integers. Preliminary results on comparing the OpenSHMEM and MPI-2 one-sided implementations on a Silicon Graphics Incorporated (SGI) cluster and the Cray XK7 are presented.


Session 3
Session Chair - Sameer Shende


INVITED TALK
1:00 PM: Invited Talk: OpenSHMEM - The InfiniBand Advantage
Rich Graham, Mellanox

The OpenSHMEM specification forms a good foundation for supporting low-overhead, distributed, and asynchronous computation.  Mellanox has been developing hardware and software capabilities that provide the needed building blocks that enable effective use of OpenSHMEM based applications.  This presentation will describe some of these hardware capabilities, with an emphasis on their instantiation in the EDR ConnectX-4 and Switch-IB hardware.  In addition the talk will describe Mellanox's work to develop a production grade OpenSHMEM implementation.

1:50 PM: Scalable Out-of-core OpenSHMEM Library for HPC
Antonio Gómez-Iglesias, Jerome Vienne, Khaled Hamidouche, William Barth and Dhabaleswar Panda

Many HPC applications have memory requirements that exceed the typical memory available on the compute nodes. While many HPC installations have resources with very large memory installed, a more portable solution for those applications is to implement an out-of-core method. This out-of-core mechanism offloads part of the data typically onto disk when this data is not required. However, this presents a problem in parallel codes since the scalability of this approach is clearly limited by the disk latency and bandwidth. Moreover, in parallel file systems this design can lead to high loads of the file system and even failures. We present a library that provides the out-of-core functionality by making use of the main memory of devoted compute nodes. This library provides a good performance scalability and reduces the impact in the parallel file system by only using the local disk of each node. We have implemented an OpenSHMEM version of this library and compared the performance of this implementation with MPI. OpenSHMEM, together with other Partitioned Global Address Space approaches, represent one of the approaches for improving the performance of parallel applications towards the exascale. In this paper we show how OpenSHMEM represents an excellent approach for this type of application.

2:15 PM: Proposing OpenSHMEM Extensions Towards a Future for Hybrid Programming and Heterogeneous Computing
David Knaak and Naveen Namashivayam

SHMEM is an important and popular Partitioned Global Address Space (PGAS) programming model. In the past few years, work has been done to standardize SHMEM under the umbrella of the OpenSHMEM Project. In March 2015, OpenSHMEM approved the API Specification Version 1.2. But even before the new specification was approved, there was a recognition that additions to the API specification are desirable to improve ease of programming and to provide better performance opportunities. Some of these extensions are particularly important for improving performance on heterogeneous system architectures with, for example, multi-core processors, processor accelerators, distributed memory, and heterogeneous memories. Cray Inc. has worked closely with some of its customers to define extensions and has already implemented many of them in Cray SHMEM. In some cases, other organizations have proposals for similar functionality. Cray is working within OpenSHMEM to agree on standardization of these and other extensions. This paper summarizes some of the Cray extensions and describes their benefits. Cray will propose, or in some cases has already proposed, these extensions in detail to OpenSHMEM. Cray will work with other organizations with the goal of reaching consensus on these extensions. The SHMEM extensions summarized in this paper include: "Thread-Safety", "Alltoall Collectives", "Flexible PE Subsets", "Shared Memory Pointers", "Put With Signal", "Non-blocking Put", and "Non-blocking Get".

2:40 PM: A Case for Non-Blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X
Ammar Ahmad Awan, Khaled Hamidouche, Ching-Hsiang Chu and Dhabaleswar Panda

An ever increased push for performance in the HPC arena has led to a multitude of hybrid architectures in both software and hardware for HPC systems. Partitioned Global Address Space (PGAS) programming model has gained a lot of attention over the last couple of years. The main advantage of PGAS model is the ease of programming provided by the abstraction of a single memory across nodes of a cluster. OpenSHMEM implementations currently implement the OpenSHMEM 1.2 specification that provides interface for one-sided, atomic, and collective operations. However, the recent trend in HPC arena in general, and Message Passing Interface (MPI) community in specific, is to use Non-Blocking Collective (NBC) communication to efficiently overlap computation with communication to save precious CPU cycles.

This work is inspired by encouraging performance numbers for NBC implementations of various MPI libraries. As the OpenSHMEM community has been discussing the use of non-blocking communication, in this paper, we propose an NBC interface for OpenSHMEM, present its design, implementation, and performance evaluation. We discuss the NBC interface that has been modeled along the lines of MPI NBC interface and requires minimal changes to the function signatures. We have designed and implemented this interface using the Unified Communication Runtime in MVAPICH2-X. In addition, we propose OpenSHMEM NBC benchmarks as an extension to the OpenSHMEM benchmarks available in the widely used OMB suite. Our performance evaluation shows that the proposed NBC implementation provides up to 96% overlap for different collectives with little NBC overhead.


Session 4
Session Chair - Nicolas Park


INVITED TALK
3:30 PM Invited Talk: Improving Application Scaling using OpenSHMEM for GPU-Initiated Communication
Sreeram Potluri, NVIDIA

State-of-the-art scientific applications running on GPU clusters typically offload computation phases onto the GPU using CUDA or directives approach while relying on the CPU to manage cluster communication. This dependency on the CPU for communication has limited their strong scalability, owing to the overhead of repeated kernel launches, CPU-GPU synchronization, underutilization of the GPU during synchronization, and underutilization of network during compute. Addressing this apparent Amdahl's fraction is imperative for strong scaling of applications on GPU clusters. GPUs are designed for extreme throughput and have enough parallelism and state to hide long latencies to global memory. CUDA programming model and practices guide application developers to take advantage of this throughput oriented architecture. It is important to take advantage of these inherent capabilities of the GPU and the CUDA programming model when tackling communication on GPU clusters. NVSHMEM is a prototype implementation of OpenSHMEM that provides a Partitioned Global Address Space (PGAS) spanning memory across multiple GPUs. It supports API for fine-grained GPU-GPU data movement and synchronization from within a CUDA kernel. This talk outlines the implementation of NVSHMEM on single-node and multi-node GPU architectures. Example applications from multiple domains are used to demonstrate the use of GPU-initiated communication and its impact on performance and scaling.

4:20 PM: An Evaluation of OpenSHMEM Interfaces for Variable- length Collective Operations
M. Graham Lopez, Pavel Shamis and Manjunath Gorentla Venkata

Alltoallv() is a collective operation which allows all processes to exchange variable amounts of data with all other processes in the communication group. This means that Alltoallv() requires not only O(N2) communications, but typically also additional exchanges of the data lengths that will be transmitted in the eventual Alltoallv() call. This pre-exchange is usually necessary in order to calculate the proper offsets for the receiving buffers on the target processes. However, we propose two additional variants for Alltoallv() that would mitigate the need for the user to set up this extra exchange of information at the possible cost of memory efficiency. We explain the interfaces for these new variants and show how a single call can be used in place of the Alltoall()/Alltoallv() pair. We then discuss the performance tradeoffs for overall communication and memory costs, and examine both software and hardware-based optimizations and their applicability to the various proposed interfaces.

4:45 PM: Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM
Jian Lin, Khaled Hamidouche, Jie Zhang, Xiaoyi Lu, Abhinav Vishnu and Dhabaleswar Panda

Machine Learning algorithms are benefiting from the continuous improvement of programming models, including MPI, MapReduce and PGAS. k-Nearest Neighbors (k-NN) algorithm is a widely used machine learning algorithm, applied to supervised learning tasks such as classification. Several parallel implementations of k-NN have been proposed in the literature and practice. However, on high-performance computing systems with high-speed interconnects, it is important to further accelerate existing designs of the k-NN algorithm through taking advantage of scalable programming models. To improve the performance of k-NN on large-scale environment with InfiniBand network, this paper proposes several alternative hybrid MPI+OpenSHMEM designs and performs a systemic evaluation and analysis on typical workloads. The hybrid designs leverage the one-sided memory access to better overlap communication with computation than the existing pure MPI design, and propose better schemes for efficient buffer management. The implementation based on k-NN program from MaTEx with MVAPICH2-X (Unified MPI+PGAS Communication Runtime over InfiniBand) shows up to 9.0% time reduction for training KDD Cup 2010 workload over 512 cores, and 27.6% time reduction for small workload with balanced communication and computation. Experiments of running with varied number of cores show that our design can maintain good scalability.

5:10 PM: Parallelizing the Smith-Waterman Algorithm using OpenSHMEM and MPI-3 One-Sided Interfaces
Matthew Baker, Aaron Welch and Manjunath Gorentla Venkata

The Smith-Waterman algorithm is used for determining the similarity between two very long data streams. A popular application of the Smith-Waterman algorithm is for sequence alignment in DNA sequences. Like many computational algorithms, the Smith-Waterman algorithm is constrained by the resources of the system. As such, it can be accelerated by parallelizing the implementation and using HPC systems. A central part of the algorithm is computing the similarity matrix that ranks the quality of matching. This access pattern to the matrix to compute the similarity is non-uniform; as such, it better suits the Partioned Global Address Space (PGAS) programming model. In this paper, we explore parallelizing the Smith-Waterman algorithm using the OpenSHMEM model and interfaces in OpenSHMEM 1.2 as well as the one-sided communication interfaces in MPI-3. Further, we also explore the advantages of using non-blocking communication interfaces, which are proposed as extensions for a future OpenSHMEM specification. We evaluate the parallel implementation on Titan, a Cray XK7 system at the Oak Ridge Leadership Computing Facility (OLCF). Our results demonstrate good weak and strong scaling characteristics for both of the OpenSHMEM and MPI-3 implementations.