OpenSHMEM 2016

AGENDA

*Tuesday, August 2nd*
7:30 AM – 8:30 AM	Working Breakfast Agenda: Meet and Greet with Steve Oberlin
8:30 AM	Registration
8:50 AM	Welcome: Neena Imam, Oak Ridge National Laboratory
9:00 AM	Keynote: GPUs, NVLink, and the Dawn of a New SHMEM Golden Age Steve Oberlin, NVIDIA In the early days of distributed HPC, SHMEM was created to enable efficient direct access by programs to the custom-architecture shared global memory of the first Cray Research MPP, the Cray T3D. In subsequent decades, the commoditization of HPC clusters around standard servers and networks contributed to the dominance of coarse-grain message-passing programming models compatible with their modest communications capabilities. Lacking efficient native GAS platforms, SHMEM applications are in the minority today, and most PGAS implementations are built on top of MPI runtimes and primitives. Recently, a new, inherently parallel processor architecture has emerged that offers, for the first time, the possibility of efficient native GAS implementation in high-volume devices. GPUs have evolved from fixed-function scan-line rendering engines into powerful general-purpose parallel coprocessors with unprecedented memory bandwidth, latency hiding, and fine-grain synchronization capabilities. NVIDIA's latest generation "Pascal" GPU architecture introduces NVLink, a new interconnect interface that extends the GPU memory model across multiple directly-connected processors, enabling native loads/stores/atomics to another device's memory at very high speed with low overhead. The success of and cost-effectiveness of GPUs has driven performance and productivity in several areas of science and technology, most recently emerging as the de facto platform for the exploding field of machine learning. Native support for global shared memory across multiple NVLink-connected GPUs could herald the arrival of a new golden age of SHMEM adoption and applications growth. This talk will sparsely review the history of architecture support for GAS and SHMEM in distributed HPC, discuss the necessary elements of HW support for efficient implementations, compare and contrast CPU and GPU microarchitectures and their ability to provide such support, introduce NVLink and Pascal's first implementation of it, and describe early NVLink-connected multi-GPU systems and some initial performance results using NVSHMEM, a native GPU-initiated-communications SHMEM implementation.
Session 1: OpenSHMEM Extensions Manjunath Gorentla Venkata, Session Chair
10:00 AM	Integrating Asynchronous Task Parallelism with OpenSHMEM Max Grossman, Vivek Kumar, Zoran Budimlic and Vivek Sarkar Computing systems are rapidly moving toward exascale, requiring highly scalable means of specifying the communication and computation to be carried out by the machine. Because of the complexity of these systems, existing communication models for High Performance Computing (HPC) are running into performance limitations as they can make it difficult to identify and exploit opportunities for computation-communication overlap. Existing communication models also lack integration with multi-threaded programming models, reducing programmability and often requiring overly coarse synchronization between the separate communication and computation programming models. The Partitioned Global Address Space (PGAS) programming models combine shared and distributed memory features, providing the basis for high-performance and high-productivity parallel programming environments. Most current PGAS approaches use complex compiler transformations to translate the user code to native code. OpenSHMEM is a very widely used PGAS programming model that offers a library-based approach. Currently, OpenSHMEM relies on other libraries (e.g., OpenMP) for harnessing node-level parallelism. This OpenSHMEM+X approach requires the expertise of a hero-level programmer and typically encounters bottlenecks on shared resources, long wait times due to load imbalances, as well as data locality problems. In this paper, we introduce an AsyncSHMEM PGAS library that supports a tighter integration of shared and distributed memory parallelism than approaches based on OpenSHMEM+X. AsyncSHMEM integrates the OpenSHMEM library with a thread-pool-based work-stealing runtime. AsyncSHMEM aims to prepare OpenSHMEM for the next generation of HPC systems by making it more adaptive and taking advantage of asynchronous computation to hide data transfer latencies, interoperate with tasks, improve load balancing (both of communication and computation), and improve locality. In this paper we present the design and implementation of AsyncSHMEM, and demonstrate the performance of AsyncSHMEM by performing a scalability analysis of two benchmarks on the Titan supercomputer. Our experiments show that AsyncSHMEM is competitive with OpenSHMEM+OpenMP model when executing highly regular workloads, while it significantly outperforms it on highly event-driven applications.
10:30 AM	Evaluating OpenSHMEM Explicit Remote Memory Access Operations and Merged Requests Swen Boehm, Swaroop Pophale and Manjunath Gorentla Venkata The OpenSHMEM Library Specification has evolved considerably since the version 1.0. Recently, non-blocking implicit RMA operations were introduced in OpenSHMEM 1.3. This provides a way to achieve overlap between communication and computation. The implicit non-blocking operations do not have a separate handle to track and complete the individual RMA operation. They are completed by shmem_quiet(), shmem_barrier() or shmem_barrier_all(), which are global completion and synchronization operations. Though this semantic is expected to achieve a higher message rate for the applications, one drawback of this semantic is that it does not allow fine-grained control over the completion of the operations. In this paper, first, we introduce non-blocking RMA operations with requests, where each operation has an explicit request to track and complete the operation. Second, we introduce interfaces for the merging the requests. The merged request tracks multiple user-selected RMA operations, which provides the flexibility of combining related communication operations into the same handle. In the last part, we explore the implications in terms of performance, productivity, usability and the possibility of defining different patterns of communication via merging of requests. Our experimental results show that a well designed and implemented OpenSHMEM stack can hide the overhead of allocating and managing the requests. The latency of RMA operations with requests is similar to blocking and implicit non-blocking RMA operations. Further, we observe that using RMA operations with requests and merging of these re- quests can improve the Scalable Synthetic Compact Applications (SSCA) SSCA1 application performance. It outperforms the implementation using blocking RMA operations and implicit non-blocking operations by 49% and 73% respectively.
11:00 AM – 11:30 AM	Break
11:30 AM	Increasing Computational Asynchrony in OpenSHMEM with Active Messages Siddhartha Jana, Tony Curtis, Dounia Khaldi and Barbara Chapman Recent reports on challenges of programming models at extreme scale suggest a shift from traditional block synchronous execution models to those that support more asynchronous behavior. The OpenSHMEM programming model enables HPC programmers to exploit underlying network capabilities while designing asynchronous communication patterns. The strength of its communication model is fully realized when these patterns are characterized with small low-latency data transfers. However, for cases with large data payloads coupled with insufficient computation overlap, OpenSHMEM programs suffer from underutilized CPU cycles. In order to tackle the above challenges, this paper explores the feasibility of introducing Active Messages in the OpenSHMEM model. Active Messages is a well established programming paradigm that enables a process to trigger execution of computation units on remote processes. Using empirical analyses, we show that this approach of moving computation closer to data provides a mechanism for OpenSHMEM applications to avoid the latency costs associated with bulk data transfers. In addition, this programming pattern helps reduce the need for unwanted synchronization among processes, thereby exploiting more asynchrony within an algorithm. As part of this preliminary work, we propose an API that supports the use of Active Messages within the OpenSHMEM execution model. We present a microbenchmark-based performance evaluation of our prototype implementation. We also compare the execution of a Traveling-Salesman Problem designed with and without Active Messages. Our experiments indicate promising benefits at scale.
12:00 PM	System-level Transparent Checkpointing for OpenSHMEM Rohan Garg, Gene Cooperman and Jerome Vienne Fault tolerance is an active area of research for OpenSHMEM programs. In this work, we present the first approach using system-level transparent checkpointing. This complements an existing approach based on application-level checkpointing. Application-level checkpointing has advantages for algorithm-based fault tolerance, while transparent checkpointing can be invoked by the system at an arbitrary time. Unlike the earlier application-level work of Hao et al., this system-level approach creates checkpoint images in stable storage, thus enabling restart at a later time or even process migration. An experimental evaluation is presented using NAS NPB benchmarks for OpenSHMEM. In order to support this work, The design of DMTCP (Distributed MultiThreaded CheckPointing) was extended to support shared memory regions in the absence of virtual memory.
12:30 PM – 1:30 PM	Working Lunch Agenda: Paper Q&A(cont.) and networking
1:30 PM	Managing Errors with OpenSHMEM Aurelien Bouteiller, Manjunath Gorentla Venkata and George Bosilca Unexpected error conditions stem from a variety of underlying causes, ranging from resource exhaustion, network failures, hardware failures, or program errors. With the ever growing scale of HPC systems, the probability of encountering a condition that requires returning one such error increases; meanwhile, error recovery and run-through failure management are becoming mature, and interoperable HPC programming paradigms starts to feature advanced error management. Stemming from these evolutions, it becomes increasingly desirable to handle gracefully error conditions in OpenSHMEM. In this paper, we present the design and rationale behind an extension of the OpenSHMEM API that permits 1) notifying user code of unexpected erroneous conditions, 2) permit customized user response to errors without incurring overhead on error-free execution path, 3) propagate the occurrence of an error condition to all Processing Elements, and 4) consistently close the erroneous epoch in order to resume the application.
2:00 PM	On Synchronization and Shared Memory Reuse in OpenSHMEM Aaron Welch, Manjunath Venkata and Barbara Chapman OpenSHMEM is an open standard for PGAS libraries that provides one-sided communication semantics. Since the standardization process was completed in 2012, the OpenSHMEM API has seen a rapid succession of proposed extensions. Among these extensions is the addition of teams of Processing Elements (PEs) for greater flexibility in defining PE subsets for problem decomposition. Adding further to this, spaces introduced the ability to manage memory exclusive to teams without the need for global synchronization. However, one problem still remains that affects the usability of teams, and that is the need for the user to manage memory used internally by the implementation for synchronization in collective operations. This paper explores the possibilities for moving this responsibility from the user to the implementation, as well as the consequences that may arise as a result. To this end, we describe three methods of implementation and discuss the implications of their use compared to traditional user management of synchronization buffers.
Session 2: Vendor Presentations Nicholas Park, Session Chair
2:30 PM	Mellanox Technologies Richard Graham Exascale by Co-Design Architecture: High performance computing has begun scaling beyond Petaflop performance towards the Exaflop mark. One of the major concerns throughout the development toward such performance capability is scalability - at the component level, system level, middleware and the application level. A Co-Design approach between the development of the software libraries and the underlying hardware can help to overcome those scalability issues and to enable a more efficient design approach towards the Exascale goal.
3:00 PM	Cray Inc. Cray SHMEM and OpenSHMEM David Knaak Cray Inc. is committed to supporting its HPC customers with a high performance and high quality implementation of the OpenSHMEM API. Our customers' needs often require that we pioneer enhancements to the API. We usually design and implement the enhancements as soon as possible to meet those needs but we also work with the OpenSHMEM Committee to define these APIs so that they also meet the needs of the broader set of OpenSHMEM users. This vendor presentation will give an overview of recent OpenSHMEM extensions and the Cray specific extensions available in Cray SHMEM. Audience participation will be encouraged for discussion of these and other possible extensions.
3:30 PM – 3:45 PM	Break
3:45 PM	Intel To Exascale and Beyond: Intel's Scalable System Framework and OpenSHMEM James Dinan To Exascale and Beyond! OpenSHMEM has earned a reputation as a highly-scalable, high-performance parallel programming model. The goal of this session will be to generate a community discussion around recent developments and upcoming trends in system architecture and what technical innovations need to be driven into OpenSHMEM to ensure our continued success.
4:15 PM	Allinea Software, Ltd. Ryan Hulguin What's new in Allinea's development tools for OpenSHMEM? In this talk, the OpenSHMEM capabilities of Allinea Forge will be explored - including the debugger DDT and the recently added support for OpenSHMEM and Cray SHMEM within the MAP profiler. Recent additions to the tools will also be explored. Additionally, an outline of what can be achieved by developers looking to solve software problems and to get more performance from OpenSHMEM code will be presented.
4:45 PM	Paratools Profiling Production OpenSHMEM Applications John C. Linford Developing high performance OpenSHMEM applications routinely involves gaining a deeper understanding of software execution, yet there are numerous hurdles to gathering performance metrics in a production environment. Most OpenSHMEM performance profilers rely on the PSHMEM interface but PSHMEM is an optional and often unavailable feature. We present a tool that generates direct measurement performance profiles of OpenSHMEM applications even when PSHMEM is unavailable. The tool operates on dynamically-linked and statically-linked application binaries, does not require debugging symbols, and functions regardless of compiler optimization level. Integrated in the TAU Performance System, the tool uses automatically generated wrapper libraries that intercept OpenSHMEM API calls to gather performance metrics with minimal overhead. Dynamically-linked applications may use the tool without modifying the application binary in any way. We demonstrate the tool in a production OpenSHMEM environment and present performance profiles of applications executing on Titan and Stampede.

*Wednesday, August 3rd*
7:30 AM – 8:50 AM	Working Breakfast Agenda: Meet and Greet with Jim Sexton
8:50 AM - 9:00 AM	Opening Remarks and Introduction Manjunath Gorentla Venkata
9:00 AM	Keynote: IBM's Directions for Data Centric Systems James C Sexton, IBM The last number of years have seen a very significant inflection point in computer systems design to tackle the new complexities which arise in the modeling, simulation and analysis of complex data. IBM has adopted a direction for future systems design which is data centric in approach and which seeks to develop through co-design solutions that can deliver extreme performance for big data analytics. This presentation will describe IBM's data centric systems approach and discuss the critical challenge which all emerging systems designs must address to provide a usable portable performance programming approach to systems that are, for technology constraints, complex and heterogeneous in makeup.
Session 3: OpenSHMEM Implementation and Use Cases Steve Poole, Session Chair
10:00 AM	Design and Implementation of OpenSHMEM using OFI on the Aries interconnect Kayla Seager, Sung-Eun Choi, Jim Dinan, Howard Pritchard and Sayantan Sur Sandia OpenSHMEM (SOS) is an implementation of the OpenSHMEM specification that has been designed to provide portability, scalability, and performance on high-speed RDMA fabrics. Libfabric is the implementation of the newly proposed Open Fabrics Interfaces (OFI) that was designed to provide a tight semantic match between HPC programming models and various underlying fabric services. In this paper, we present the design and evaluation of the SOS OFI transport on Aries, a contemporary, high-performance RDMA interconnect. The implementation of Libfabric on Aries uses uGNI as the lowest-level software interface to the interconnect. uGNI is a generic interface that can support both message passing and one-sided programming models. We compare the performance of our work with that of the Cray SHMEM library and demonstrate that our implementation provides performance and scalability comparable to that of a highly tuned, production SHMEM library. Additionally, the Libfabric message injection feature enabled SOS to achieve a performance improvement over Cray SHMEM for small messages in bandwidth and random access benchmarks.
10:30 AM	OpenSHMEM-UCX : Evaluation of UCX for implementing OpenSHMEM Programming Model Matthew Baker, Ferrol Aderholdt, Manjunath Gorentla Venkata and Pavel Shamis The OpenSHMEM reference implementation was developed towards the goal of developing an open source and high-performing OpenSHMEM implementation. To achieve portability and performance across various networks, the OpenSHMEM reference implementation uses GAS- Net and UCCS for network operations. Recently, new network layers have emerged with the promise of providing high-performance, scalability, and portability for HPC applications. In this paper, we implement the OpenSHMEM reference implementation to use the UCX framework for network operations. Then, we evaluate its performance and scalability on Cray XK systems to understand UCX's suitability for developing the OpenSHMEM programming model. Further, we develop a benchmark called SHOMS for evaluating the OpenSHMEM implementation. Our experimental results show that OpenSHMEM-UCX outperforms the vendor supplied OpenSHMEM implementation in most cases on the Cray XK system. The message rate is 40% better, and executes the application kernel faster by 70%. However, the vendor supplied OpenSHMEM has better latency than OpenSHMEM-UCX implementation.
11:00 AM – 11:30 AM	Break
11:30 AM	SHMemCache: Enabling Memcached on the OpenSHMEM Global Address Model Huansong Fu, Kunal Singharoy, Manjunath Gorentla Venkata, Yue Zhu and Weikuan Yu Memcached is a popular key-value memory store for big data applications. Its performance and scalability is directly related to the underlying run-time systems including the communication protocols. OpenSHMEM is a strong run-time system that supports data access to both local and remote memory through a simple shared-memory addressing model. In view of the communication compatibilities between Memcached and OpenSHMEM, we propose to integrate the programmability and portability of OpenSHMEM for supporting Memcached on a wide variety of HPC systems. In this paper, we present the design and implementation of SHMemCache, an OpenSHMEM-based communication conduit for Memcached, which can expand the deployment scope of Memcached to various leadership facilities with OpenSHMEM run-time. Our experimental results show that SHMemCache achieves similar performance for SET and GET operations compared to the existing Memcached.
12:00 PM	An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor James Ross and David Richie This paper reports the implementation and performance evaluation of the OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the Parallella single-board computer. The Epiphany architecture exhibits massive many-core scalability with a physically compact 2D array of RISC CPU cores and a fast network-on-chip (NoC). While fully capable of MPMD execution, the physical topology and memory-mapped capabilities of the core and network translate well to Partitioned Global Address Space (PGAS) programming models and SPMD execution with SHMEM.
12:30 PM – 1:30 PM	Working Lunch Agenda: Paper Q&A(cont.) and networking
1:30 PM	Invited Talk: Designing OpenSHMEM and Hybrid MPI+OpenSHMEM Libraries for Exascale Systems: MVAPICH2-X Experience Dhabaleswar K. (DK) Panda (Introduction by Barney Maccabe) This talk will focus on challenges in designing scalable and high-performance OpenSHMEM and hybrid MPI+OpenSHMEM libraries for exascale systems. Motivations, features and design guidelines for supporting OpenSHMEM and hybrid MPI and PGAS (including OpenSHMEM, UPC and CAF) programming model with the MVAPICH2-X library will be presented. The role of unified communication runtime to support OpenSHMEM and hybrid programming models on InfiniBand, NVIDIA GPGPUs (while exploiting GPUDirect RDMA and CUDA-Aware OpenSHMEM) and Intel Xeon Phi will be outlined. Unique capabilities of the hybrid MPI+PGAS model to re-design HPC applications to harness performance and scalability will also be presented through a set of case studies.
Session 4: Hybrid Programming and Benchmarking Barney Maccabe, Session Chair
2:30 PM	A Comprehensive Evaluation of Thread-Safe and Contexts-Domains Features in Cray SHMEM Naveen Namashivayam, David Knaak, Bob Cernohous, Nick Radcliffe and Mark Pagel OpenSHMEM is a library interface specification which is a culmination of a unification effort among various implementers and users in SHMEM programming community. To standardize the interaction between OpenSHMEM calls and threads, "Thread-safe" and "Contexts" are two major proposals. Cray SHMEM is a vendor specific OpenSHMEM implementation from Cray Inc., It supports a working model of features from "Thread-safe" proposal, as non-standard SHMEMX prefixed extensions. As a part of the work described in this paper, we developed a prototype version of features from "Context" proposal in Cray SHMEM. In this paper, we use the existing thread-safe features and prototyped context features in Cray SHMEM to provide a comprehensive design analysis of the two proposals. We also analyze the possible co-existence of extensions from these two proposals. For performance study, we use a modified version of OSU Microbenchmarks, along with implementation of all-to-all collective communication pattern using the new extensions. To the best of our knowledge, this is the first paper to compare and contrast these two different proposals in OpenSHMEM.
3:00 PM	OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture David Richie and James Ross There is interest in exploring hybrid OpenSHMEM + X programming models to extend the applicability of the OpenSHMEM interface to more hardware architectures. We present a hybrid OpenCL + OpenSHMEM programming model for device-level programming for architectures like the Adapteva Epiphany many-core RISC array processor. The Epiphany architecture comprises a 2D array of low-power RISC cores with minimal un-core functionality connected by a 2D mesh Network-on-Chip (NoC). The Epiphany architecture offers high computational energy efficiency for integer and floating point calculations as well as parallel scalability. The Epiphany-III is available as a coprocessor in platforms that also utilize an ARM CPU host. OpenCL provides good functionality for supporting a co-design programming model in which the host CPU offloads parallel work to a coprocessor. However, the OpenCL memory model is inconsistent with the Epiphany memory architecture and lacks support for inter-core communication. We propose a hybrid programming model in which OpenSHMEM provides a better solution by replacing the non-standard OpenCL extensions introduced to achieve high performance with the Epiphany architecture. We demonstrate the proposed programming model for matrix-matrix multiplication based on Cannon's algorithm showing that the hybrid model addresses the deficiencies of using OpenCL alone to achieve good benchmark performance.
3:30 PM – 3:45 PM	Break
3:45 PM	OpenSHMEM Implementation of HPCG Benchmark Eduardo D'Azevedo, Sarah Powers and Neena Imam We describe the effort to implement the High Performance Conjugate Gradient (HPCG) benchmark using OpenSHMEM and MPI one-sided communication. Unlike the High Performance LINPACK (HPL) benchmark that places emphasis on large dense matrix computations, HPCG benchmark is dominated by sparse operations such as sparse matrix-vector product, sparse matrix triangular solve, and long vector operations. The MPI one-sided implementation is developed using the one-sided OpenSHMEM implementation. Preliminary results on comparing the original MPI, OpenSHMEM, and MPI one-sided implementations on an SGI cluster, Cray XK7 and Cray XC30 are presented. The results suggest the MPI, OpenSHMEM and MPI one-sided implementations all obtain similar overall performance but the MPI one-sided implementation seems to slightly increase the run time for multigrid preconditioning in HPCG on the Cray XK7 and Cray XC30.
4:15 PM	Using Hybrid Model OpenSHMEM + CUDA to Implement the SHOC Benchmark Suite Megan Grodowitz, Eduardo D'azevedo, Sarah Powers and Neena Imam This work describes the process of porting the Scalable Heterogeneous Computing (SHOC) benchmark suite from the hybrid MPI+CUDA implementation to OpenSHMEM+CUDA. SHOC includes a wide variety of benchmark kernels used to measure accelerator performance in both single node and cluster configurations. The hybrid model implementation attempts to place all major computation on accelerator devices, and uses MPI to synchronize and aggregate results. In some cases, MPI Groups are used to gradually reduce the number of accelerators used for computation as the problem size drops. Porting this behavior to OpenSHMEM required implementing several synchronizing collective operations, and using SHMEM teams to replace MPI Group functionality. Benchmark results on a Cray XK7 system with one GPU per compute node show that SHMEM performance is equal to MPI performance in these hybrid tasks. These results and porting experience show that using OpenSHMEM for accelerator devices benefits from adding functionality for synchronization and teams, and would further benefit from adding support for communication within accelerator kernels.
Session 5: Works In Progress – Lightning Talks Pavel Shamis, Session Chair
4:45 PM	Benchmarking OpenSHMEM Multi-threaded performance Hans Weeks, Matthew Dosanjh, Patrick Bridges and Ryan Grant
5:00 PM	Investigating Data Motion Power Trends to Enable Power-Efficient OpenSHMEM Implementations Tiffany M. Mintz, Eduardo D'Azevedo, Manjunath Gorentla Venkata and Chung-Hsing Hsu
5:15 PM	Closing Remarks: Neena Imam

*Thursday, August 4th*
7:30 AM – 9:00 AM	Working Breakfast Agenda: Morning Networking - Roundtable
9:00 AM - 9:30 AM	General Announcements
9:30 AM - 10:00 AM	Voting on Finalized Tickets
10:00 AM - 11:00 AM	Specification Discussion: Reading Tickets/Discussions
11:00 AM – 11:30 AM	Break
11:30 AM - 12:30 PM	Specification Discussion: Reading Tickets/Discussions (cont.)
12:30 PM – 1:30 PM	Working Lunch Agenda: Specification Q&A (cont.) and networking
1:30 PM – 3:00 PM	Specification Discussion: Reading Tickets/Discussions (cont.)
3:00 PM – 3:30 PM	Break
3:30 PM – 4:00 PM	Specification Discussion: Reading Tickets/Discussions (cont.)

Finalized Tickets:	#205
Ballot Items:	1. Elect Steve Poole to the post of OpenSHMEM committee chair 2. Elect Manjunath Gorentla Venkata to the post of OpenSHMEM committee secretary
Tickets:	#169 – Pasha Shamis #202 – Bryant Lam #212, #217 – Manjunath Gorentla Venkata #211 – James Dinan #189, #216, #222, #223 – Nicholas Park
Discussion Topics:	Explicit RMA Threading Model OpenSHMEM Teams

Dates

OpenSHMEM 2016

AGENDA