OpenSHMEM 2017: Fourth workshop on OpenSHMEM and Related Technologies
AGENDA
OpenSHMEM 2017 Agenda
Monday, August 7th
7:30 AM |
Working Breakfast |
8:30 AM |
Registration |
9:00 AM |
Welcome: Neena Imam, Oak Ridge National Laboratory |
9:15 AM |
Keynote: Shared Memory HPC Programming: Past, Present, and Future? |
10:00 AM |
Coffee Break |
|
Session 1: Invited Talks |
10:15 AM |
HPE Title: Gen-Z, The Machine, and OpenSHMEM The HPC environment is evolving to combine the performance of traditional programming models like MPI and OpenSHMEM with the flexibility and fault tolerance of big data programming models. The Gen-Z interconnect and HPE's memory driven computing concepts will deliver the hardware environment. In this talk we will briefly discuss this hardware and then delve into the mental model they enable, and the needed software. OpenSHMEM is poised to make the most of this new world, but it will require several enhancements. We will show how prior proposals by members of the OpenSHMEM community and additional changes will provide the middleware that future applications require. |
11:15 AM |
Paratools |
|
Title: Performance Analysis of OpenSHMEM Applications with TAU Commander |
|
The TAU Performance System (TAU) is a powerful and highly versatile profiling and tracing tool ecosystem for performance engineering of parallel programs. Developed over the last twenty years, TAU has evolved with each new generation of HPC systems and scales efficiently to hundreds of thousands of cores. TAU's organic growth has resulted in a loosely coupled software toolbox such that novice users first encountering TAU's complexity and vast array of features are often intimidated and easily frustrated. To lower the barrier to entry for novice TAU users, ParaTools and the US Department of Energy have developed "TAU Commander," a performance engineering workflow manager that facilitates a systematic approach to performance engineering, guides users through common profiling and tracing workflows, and offers constructive feedback in case of error. This work compares TAU and TAU Commander workflows for common performance engineering tasks in OpenSHMEM applications and demonstrates workflows targeting two different SHMEM implementations, Intel Xeon "Haswell" and "Knights Landing" processors, direct and indirect measurement methods, callsite, profiles, and traces. |
12:15 PM |
Working Lunch |
|
Session 2: OpenSHMEM Extensions |
1:30 PM |
Symmetric Memory Partitions in OpenSHMEM: A case study with Intel KNL |
|
To extract best performance from emerging tiered memory systems, it is essential for applications to use the different kinds of memory available on the system. OpenSHMEM memory model consists of data objects that are private to each Processing Element (PE) and data objects that are remotely accessible by all PEs. The remotely accessible data objects are called Symmetric Data Objects and are allocated on a memory region called as Symmetric Heap. Symmetric Heap is created during program execution on a memory region determined by the OpenSHMEM implementation. This paper proposes a new feature called Symmetric Memory Partitions to enable users to determine the size along with other memory traits for creating the symmetric heap. Moreover, this paper uses Intel KNL processors as an example use case for emerging tiered memory systems. This paper also describes the implementation of symmetric memory partitions in Cray SHMEM and use ParRes OpenSHMEM microbenchmark kernels to show the benefits of selecting the memory region for the symmetric heap. |
2:00 PM |
Implementation and Evaluation of OpenSHMEM Contexts Using OFI Libfabric |
|
HPC system and processor architectures are trending toward increasing numbers of cores and tall, narrow memory hierarchies. As a result, programmers have embraced hybrid parallel programming as a means of tuning for such architectures. While popular HPC communication middlewares, such as MPI, allow the use of threads, most fall short of fully-integrating threads with the communication model. The OpenSHMEM contexts proposal promises thread isolation and direct mapping of threads to network resources; however, fully realizing these potentials will be dependent upon support for efficient threaded communication through the underlying layers of the networking stack. In this paper, we explore the mapping of OpenSHMEM contexts to the new OpenFabrics Interfaces (OFI) libfabric communication layer and use the libfabric GNI provider to access the Aries interconnect. We describe the design of our multithreaded OpenSHMEM middleware and evaluate both the programmability and performance impacts of contexts on single- and multi-threaded OpenSHMEM programs. Results indicate that the mapping of contexts to the Aries interconnect through libfabric incurs low overhead and that contexts can provide significant performance improvements to multithreaded OpenSHMEM programs. |
2:30 PM |
Merged Requests for Better Performance and Productivity in Multithreaded OpenSHMEM A merged request is a handle representing a group of Remote Memory Access (RMA), Atomic or Collective operations. The merged re-quest can be created either by combining multiple outstanding merged request handles or using the same merged request handle for additional operations. We show that introducing such simple yet powerful semantics in OpenSHMEM provides many productivity and performance advantages. In this paper, we first introduce the interfaces and semantics for creating and using merged request handles. Then, we demonstrate with a merge request that we can achieve better performance characteristics in multithreaded OpenSHMEM application. Particularly, we show one can achieve higher message rate, a higher bandwidth for smaller message, and better computation-communication overlap. Further, we use merged request to realize multithreaded collectives, where multiple threads co-operate to complete the collective operation. Our experimental results show that in a multithreaded OpenSHMEM program, the merged request based RMA operations achieve over 100 Million Messages Per Second (MMPS). It achieves over 10 MMPS compared to 4.5 MMPS with default RMA operations in a single thread environment. Also, we achieve higher bandwidth for smaller message sizes, close to 100% overlap, and reduce the latency by 60%. |
3:00 PM |
Evaluating Contexts in OpenSHMEM-X Reference Implementation Many-core processors are now ubiquitous in supercomputing. This evolution pushes toward the adoption of mixed models in which cores are exploited with threading models (and related programming abstractions, such as OpenMP), while communication between distributed memory domains employ a communication Application Programming Interface (API). OpenSHMEM is a partitioned global address space communication specification that exposes one-sided and synchronization operations. As the threaded semantics of OpenSHMEM are being fleshed out by its standardization committee, it is important to assess the soundness of the proposed concepts. This paper implements and evaluate the "context" extension in relation to threaded operations. We discuss the implementation challenges of the context and the associated API in OpenSHMEM-X. We then evaluate its performance in threaded situations on the Infiniband network using micro-benchmarks and the Random Access benchmark and see that adding communication contexts significantly improves message rate achievable by the executing multi-threaded PEs. |
|
|
3:30 PM |
Coffee Break |
|
Session 3: Invited Talks |
3:45 PM |
Mellanox Technologies Title: InfiniBand Enhancements in Support of the OpenSHMEM Specification Mellanox technologies develops and supports an implementation of the OpenSHMEM specification, and currently supports version 1.3 of the specification. As part of this support, Mellanox has introduced capabilities that enhance the quality and performance of this support. This talk will discuss the latest InfiniBand optimizations used to enhance the support for the OpenSHMEM specification. |
4:45 PM |
NVIDIA Title: A Deep Drive into NVIDIA's Volta Architecture NVIDIA continues to push the performance boundaries of GPU architectures. Last year NVIDIA released the Pascal architecture, which was hailed as an impressive leap forwards to high-performance computing. This year NVIDIA announced the Volta architecture that pushed the boundaries even farther. This talk will touch on new features of the Volta architectures and how the new Tensor Cores can achieve 120 TFLOPS of performance from a single GPU. |
7:30 AM |
Working Breakfast |
8:30 AM |
Registration |
9:00 AM |
Welcome and First Day Recap: Manjunath Gorentla Venkata, ORNL |
Session 4: Invited Talk |
|
9:15 AM |
AMD Title: OpenCAPI, Gen-Z, CCIX: Technology Overview, Trends, and Alignments The past year has seen the formation of three major system interconnect consortia: OpenCAPI, Gen-Z, and CCIX. These organizations share similar motivations to enable more efficient transfer and sharing of data among various components in a system, such as CPUs, GPUs, PGAs, non-volatile storage technologies, attached memory, etc. All three organizations are promoting adoption of their designs as industry standards. Although there are clear areas of differentiation among the three, there are also areas of overlap. All three specifications are less than a year old at this point and are still evolving. The makeup of the organizations themselves is also in flux. In times past, there have been similar points where multiple technology standards have been proposed that, ultimately, were resolved into a single common standard, such as the merger of the Future I/O and Next Generation I/O efforts resulting in InfiniBand. Will the same happen with these three ? This talk will concentrate on an overview and technology comparison of the three approaches, discuss trends that are starting to clarify, and suggest possible alignments moving forward. |
10:00 AM |
Coffee Break |
|
Session 5: Evaluations, Implementations, |
10:15 AM |
Application-Level Optimization of On-Node Communication in OpenSHMEM The OpenSHMEM community is actively exploring threading support extensions to the OpenSHMEM communication interfaces. Among the motivations for these extensions are the optimization of on-node data sharing and reduction of memory pressure, both of which are problems that hybrid programming has successfully addressed in other programming models. We observe that OpenSHMEM already supports inter-process shared memory for processes within the same node. In this work, we assess the viability of this existing API to address the on-node optimization problem, which is of growing importance. We identify multiple on-node optimizations that are already possible with the existing interface, propose a layered library that extends the functionality of these interfaces, and measure performance improvement when using these techniques. |
10:45 AM |
Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI The integration of Big Data frameworks and HPC capabilities has drawn enormous interests in recent years. SHMEMCache is a distributed key-value store built on the OpenSHMEM global address space. It has solved several practical issues in leveraging OpenSHMEM's one-sided operations for a distributed key-value store and providing efficient key-value operations on both commodity machines and supercomputers. However, being based solely on OpenSHMEM, SHMEMCache cannot leverage one-sided operations from a variety of software packages. This results in several limitations for SHMEMCache. First, we cannot make SHMEMCache available to a wider range of platforms. Second, an opportunity for potential performance improvement is missed. Third, there is a lack of deep understanding about how different one-sided operations can fit in with SHMEMCache and other distributed key-values in general. For example, the one-sided operations in OpenSHMEM and MPI have many differences in their interfaces, memory semantics and synchronization methods, all of which can have distinct implications and also increase the complexity in supporting both OpenSHMEM and MPI for SHMEMCache. Therefore, we have taken on an effort on leveraging various one-sided operations for SHMEMCache and proposed a design of portable SHMEMCache. Based on this new framework, we have supported both OpenSHMEM and MPI for SHMEMCache. We have also conducted an extensive set of experiments to compare the performance of the two versions on both commodity machines and the Titan supercomputer. |
Balancing Performance and Portability with Containers in HPC: An OpenSHMEM Example There is a growing interest in using Linux containers to streamline software development and application deployment. A container enables the user to bundle the salient elements of the software stack from an application's perspective. In this paper, we discuss initial experiences in using the Open MPI implementation of OpenSHMEM with containers on HPC resources. We provide a brief overview of two container runtimes (Docker & Singularity), highlighting elements that are of interest for HPC users. The Docker platform offers a rich set of services that are widely used in enterprise environments, whereas Singularity is an emerging container runtime that is specifically written for use on HPC systems. We describe our procedure for container assembly and deployment that strives to maintain the portability of the container-based application. We show performance results for the Graph500 benchmark running along the typical continuum of development testbed up to full production supercomputer (ORNL's Titan). The results show consistent performance between the native and Singularity (container) tests. The results also showed an unexplained drop in performance when using the Cray Gemini network with Open MPI's OpenSHMEM, which was unrelated to the container usage. |
|
11:45 AM |
Exploiting and Evaluating OpenSHMEM on KNL Architecture |
|
Manycore processors such as Intel Xeon Phi (KNL) with on-package Multi-Channel DRAM (MCDRAM) are making a paradigm shift in the High Performance Computing (HPC) industry. PGAS programming models such as OpenSHMEM due to its lightweight synchronization primitives and shared memory abstractions are considered a good fit for irregular communication patterns. While regular programming models such as MPI/OpenMP have started utilizing systems with KNL processors, it is still not clear whether PGAS models can easily adopt and fully utilize such systems. In this paper, we conduct a comprehensive performance evaluation of the OpenSHMEM runtime on many-/multi-core processors. We also explore the performance benefits offered by the highly multithreaded KNL along with the AVX-512 extensions and MC- DRAM for OpenSHMEM programming model. We evaluate Intra- and Inter-node performance of OpenSHMEM primitives on different application kernels. Our evaluation of application kernels such as NAS Parallel Benchmark and 3D-Stencil kernels show that OpenSHMEM with MVPAICH2-X runtime is able to take advantage of AVX-512 extensions and MCDRAM to exploit the architectural features provided by KNL processors. |
12:15 PM |
Working Lunch |
|
Session 6: OpenSHMEM Applications |
1:30 PM |
Parallelizing Single Source Shortest Path with OpenSHMEM Single Source Shortest Path (SSSP) is one of the widely occurring graph problems where the paths are discovered from an origin vertex to all other vertices in the graph. In this paper, we discuss our experience parallelizing SSSP using OpenSHMEM. We start with the serial Dijkstra and Bellman-Ford algorithms, parallelize these algorithms, and adapt them to the Partitioned Global Address Space (PGAS) programming model. We implement the parallel algorithms using OpenSHMEM and introduce a series of optimizations to achieve higher scaling and performance characteristics. The implementation is evaluated on Titan with various graphs including synthetic Recursive Matrix (R-MAT) and small-world network graphs as well as real-world graphs from Facebook, Twitter, LiveJournal, and the road maps of California and Texas. |
2:00 PM |
Efficient Breadth First Search on Multi-GPU Systems using GPU-centric OpenSHMEM NVSHMEM is an implementation of OpenSHMEM for NVIDIA GPUs which allows communication to be issued from inside CUDA kernels. In this work, we present an implementation of Breadth First Search for multi-GPU systems using NVSHMEM. We analyze the benefits and bottlenecks of moving fine-grained communication into CUDA kernels. Using our implementation of BFS, we achieve up to 60% improvement in performance compared to a CUDA-aware MPI-based implementation, in the best case. We see around 19% improvement on peak GTEPS achieved on a system with 8 GPUs. |
|
Session 7: Invited Talks |
2:30 PM |
ARM Title: OpenSHMEM on ARM Applications, programming languages, and libraries that leverage sophisticated network hardware capabilities have a natural advantage when used in today's and tomorrow's high-performance and data center computer environments. Modern RDMA based network interconnects provides incredibly rich functionality (RDMA, Atomics, OS-bypass, etc.) that enable low-latency and high-bandwidth communication services. The functionality is supported by a variety of interconnect technologies such as InfiniBand, RoCE, iWARP, Intel OPA, Cray's Aries/Gemini, and others. With the emerging availability HPC solutions based on ARM CPU architecture, it is important to understand how ARM integrates with the RDMA hardware and OpenSHMEM programming model. In this talk, we will overview ARM architecture and OpenSHMEM software stack. We will discuss how ARM CPU interacts with network devices and accelerators. In addition, we will share our experience in enabling OpenSHMEM on ARM and share preliminary evaluation results. |
3:30 PM |
Coffee Break |
3:45 PM |
Cray Inc. Title: Cray SHMEM: Current State and Future Directions Global supercomputer leader Cray Inc. provides HPC customers with a high performance and high quality implementation of the OpenSHMEM API in the Company's flagship line of Cray XC supercomputers. Cray's implementation of OpenSHMEM is called as Cray SHMEM. This vendor presentation will provide an overview of the roadmap to extend the support of Cray SHMEM for future platforms. It also highlights the recent performance enhancements and the Cray specific extensions added to meet the requirements of a broader set of OpenSHMEM users. |
4:45 PM |
Closing Remarks: Neena Imam |
7:30 AM |
Working Breakfast |
8:30 AM |
Registration |
9:00 AM |
Voting and Reading of Tickets |
10:00 AM |
Coffee Break |
10:15 AM |
Voting, Reading, and Discussion of Tickets |
12:15 PM |
Working Lunch |
1:30 PM |
Reading and Discussion of Tickets (cont.) |
3:30 PM |
Coffee Break |
3:45 PM |
Reading and Discussion of Tickets (cont.) |
Voting
Reading
- PR #89 Update shmem_test() example (Nick)
- PR #92 Remove expanded type support for the newly-deprecated shmem wait (Nick)
- PR #90 Query Available Symmetric Heap Memory (Manju)
- PR #63 Communication Contexts Extension (Jim)
- PR #64 Add sync routines (Jim)
Discussion
- Explicit RMA and Merged Request Handles (Swen)
- github Issues (Authors of the Issues)
- Release of v1.4 and schedule for future Meetings (Steve/Manju/Jim)
- Update on specification language improvement/rewrite (Mike Culhane/Steve)