OpenSHMEM 2018

AGENDA

Tuesday, August 21st

8:00 AM	Working Breakfast and Registration Agenda: Discussion of OpenSHMEM 2019 theme and feedback Led by: Manjunath Gorentla Venkata, Oak Ridge National Laboratory
9:00 AM	Welcome: Neena Imam, Oak Ridge National Laboratory
9:15 AM	Invited Talk: InfiniBand In-Network Computing Technology and Roadmap *Gil Bloch, Mellanox*
10:15 AM	Break
	Session 1: OpenSHMEM Implementations and API Chair: Manjunath Gorentla Venkata, ORNL
10:30 AM	Design and Optimization of OpenSHMEM 1.4 for the Intel® Omni-Path Fabric 100 Series *David Ozog (Intel)*, Md. Wasi-ur-Rahman (Intel), Kayla Seager (Intel), and James Dinan (Intel). The OpenSHMEM 1.4 specification recently introduced support for multithreaded hybrid programming and a new communication management API. Together, these features enable users to manage communications performed by multiple threads within an OpenSHMEM process and to overlap communication and computation to hide costly latencies. In order to realize these benefits, OpenSHMEM implementations must efficiently map this broad new space of usage models to the underlying fabric. In this paper, we present an implementation of OpenSHMEM 1.4 for the Intel® Omni-Path Fabric 100 Series. The OpenFabrics Interface's (OFI) libfabric is used as the low-level fabric API in conjunction with the Intel® Performance Scaled Messaging 2 (PSM2) fabric provider. We identify strategies for effectively managing shared transmission resources using libfabric, as well as managing the communication requirements of the PSM2 layer. We study the performance of our implementation, identify design tradeoffs that are influenced by application behavior, and explore application-level optimizations that can be used to achieve the best performance.
11:00 AM	Introducing Cray OpenSHMEMX - A Modular Multi-Communication Layer OpenSHMEM Implementation *Naveen Namashivayam (Cray)*, Bob Cernohous (Cray), Dan Pou (Cray), and Mark Pagel (Cray) SHMEM has a long history as a parallel programming model. It is extensively used since 1993, starting from Cray T3D systems. For the past two decades SHMEM library implementation in Cray systems evolved through different generations. The current generation of the SHMEM implementation for Cray XC and XK systems is called Cray SHMEM. It is a proprietary SHMEM implementation from Cray Inc. In this work, we provide an in-depth analysis of need for a new SHMEM implementation and then introduce the next evolution of Cray SHMEM implementation for current and future generation Cray systems. We call this new implementation Cray OpenSHMEMX. We provide brief design overview, along with a review of functional and performance differences in Cray OpenSHMEMX comparing against the existing Cray SHMEM implementation.
11:30 AM	An Initial Implementation of Libfabric Conduit for OpenSHMEM-X *Subhadeep Bhattacharya (Florida State University)*, Shaeke Salman (Florida State University), Manjunath Gorentla Venkata (Oak Ridge National Laboratory), Harsh Kundnani (Florida State University), Neena Imam (Oak Ridge National Laboratory), and Weikuan Yu (Florida State University). As a representative of Partitioned Global Address Space models, OpenSHMEM provides a variety of functionalities including one-sided communication, atomic operations, and collective routines. The communication layer of OpenSHMEM-X plays a crucial role for its functionalities. OFI Libfabric is an open-source network library that supports portable low-latency interfaces from different fabric providers while minimizing the semantic gap across API endpoints. In this paper, we present the design and implementation of OpenSHMEM-X communication conduit using Libfabric. This Libfabric conduit is designed to support a broad range of network providers while achieving excellent network performance and scalability. We have performed an extensive set of experiments to validate the performance of our implementation, and compared with the Sandia OpenSHMEM implementation. Our results show that the Libfabric conduit improves the communication bandwidth on the socket provider by up to 42% and 11%, compared to an alternative OpenSHMEM implementation for put and get operations, respectively. In addition, our implementation of atomic operations has achieved similar latency to that of the Sandia implementation.
12:00 PM	The OpenFAM API: a programming model for disaggregated persistent memory *Kimberly Keeton (HPE)*, Sharad Singhal (HPE), and Michael Raymond (HPE). Recent technology advances in high-density, byte-addressable non-volatile memory (NVM) and low-latency interconnects have enabled building large-scale systems with a large disaggregated fabric-attached memory (FAM) pool shared across heterogeneous and decentralized compute nodes. In this model, compute nodes are decoupled from FAM, which allows separate evolution and scaling of processing and fabric-attached memory. The large capacity of the FAM pool means that large working sets can be maintained as in-memory data structures. The fact that all compute nodes share a common view of memory means that data sharing and communication may be done efficiently through shared memory, without requiring explicit messages to be sent over heavyweight network protocol stacks. Additionally, data sets no longer need to be partitioned between compute nodes, as is typically done in clustered environments. Any compute node can operate on any data item, which enables more dynamic and flexible load balancing. The OpenFAM API is an API for programming with persistent FAM that is inspired by the OpenSHMEM partitioned global address space (PGAS) model. Unlike OpenSHMEM, where each node contributes local memory toward a logically shared global address space, FAM isn't associated with a particular node and can be addressed directly from any node without the cooperation or involvement of another node. The OpenFAM API enables programmers to manage memory allocations, access FAM-resident data structures, and order FAM operations. Because state in FAM can survive program termination, the API also provides interfaces for naming and managing data beyond the lifetime of a single program invocation.
12:30 PM	Working Lunch Agenda: Discussion and feedback on Session 1 Led by: Neena Imam, Oak Ridge National Laboratory
	Session 2: Simulators, Emulators, and OpenSHMEM Collectives Chair: Barney Maccabe, ORNL
1:30 PM	Towards Lightweight and Scalable Simulation of Large-Scale OpenSHMEM Applications *M. J. Levenhagen (Sandia National Laboratory)*, S. D. Hammond (Sandia National Laboratory), and K. S. Hemmert (Sandia National Laboratory). Rapid changes are coming to the high-performance computing community, including significant growth in the use of machine learning and large-scale high-performance data analytics. Traditionally, HPC has been dominated by the use of the Message Passing Interface (MPI), but the increasing use of smaller, fine-grained communication is now making efficient one-sided programming models like OpenSHMEM more attractive. Historically, the accurate analysis and projection of parallel algorithms written to one-sided models has been challenging, in part, because of the complexity associated with the tight bounds on modeling fine grained data movement, as well as complex interactions between the network and memory subsystems. In this paper, we describe recent extensions to Sandia's high-performance, parallel Structural Simulation Toolkit (SST) which permit rapid evaluation and projection for communication patterns written to one-sided paradigms -- including explicit support for OpenSHMEM like functionality and communication patterns. In our initial work we demonstrate multi-rank random-access like communication patterns, comparing simulated results with benchmarked values from a Cray XC40 Aries-based interconnect that is known to be efficient for fine-grained communication. The models show strong predictive accuracy and trends when varying characteristic hardware parameters such as the Aries virtual page size. We document our current approaches and the significant components of our model that allow for what-if analyses to be completed by the community, thereby establishing SST as a a reliable predictive toolkit for users of OpenSHMEM.
2:00 PM	Scaling OpenSHMEM for Massively Parallel Processor Arrays *James Ross (ARL)* and David Richie (Army Research Laboratory). The use of OpenSHMEM has traditionally focused on supporting a one-sided communication mechanism between networked processors. The US Army Research Laboratory (ARL) OpenSHMEM implementation for the Epiphany architecture has highlighted the utility of OpenSHMEM for the precise control of on-die data movement within arrays of RISC cores connected by a 2D mesh Network on Chip (NoC), and was demonstrated using a 16-core Epiphany-III co-processor. More recently, DARPA has fabricated a much larger 64-bit 1,024-core Epiphany-V device, which ARL is presently evaluating. In support of this effort, we have developed an Epiphany-based RISC SoC device emulator that can be installed as a virtual device on an ordinary x86 platform and utilized with the existing software stack used to support physical devices, thus creating a seamless software development environment capable of targeting new processor designs just as they would be interfaced on a real platform. As massively parallel processor arrays (MPPAs) emerge as a strong contender for future exascale architectures, we investigate the application of OpenSHMEM as a programming model for processors with hundreds to thousands of cores. In this work we report on the initial results from scaling up the ARL OpenSHMEM implementation using virtual RISC processors with much larger core counts than previous physical devices.
2:30 PM	Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM *Ching-Hsiang Chu (NVIDIA),* Sreeram Potluri (NVIDIA), Anshuman Goswami (NVIDIA), Manjunath Gorentla Venkata (Oak Ridge National Laboratory), Neena Imam (Oak Ridge National Laboratory), and Chris J. Newburn (NVIDIA). Graphics Processing Units (GPUs) are well-known for their massive parallelism and high bandwidth memory for data-intensive applications. In this context, GPU-based In-Memory Key-Value (G-IMKV) Stores have been proposed to take advantage of GPUs' capability to achieve high-throughput data processing. The state-of-the-art frameworks in this area batch requests on the CPU at the server before launching a compute kernel to process operations on the GPU. They also use explicit data movement operations between the CPU and GPU. The startup overhead of compute kernel launches and memory copies limits the throughput of these frameworks unless operations are batched into large groups. In this paper, we propose the use of persistent GPU compute kernels and of OpenSHMEM to maximize GPU and network utilization with smaller batch sizes. This also helps improve the response time observed by clients while still achieving high throughput at the server. Specifically, clients and servers use OpenSHMEM primitives to move data by avoiding copies, and the server interacts with a persistently running compute kernel on its GPU to delegate various key-value store operations efficiently to streaming multi-processors.
3:00 PM	*SHCOLL - a Standalone Implementation of OpenSHMEM-style Collectives API* *Srdan Milakovic (Rice University)*, Zoran Budimlic (Rice University), Howard Pritchard (Los Alamos National Laboratory), Anthony Curtis (Stony Brook University), Barbara Chapman (Stony Brook University) and Vivek Sarkar (Georgia Institute of Technology). The performance of collective operations has a large impact on overall performance in many HPC applications. Implementing multiple algorithms and selecting optimal one depending on message size and the number of processes involved in the operation is essential to achieve good performance. In this paper, we will present SHCOLL, a collective routines library that was developed on top of OpenSHMEM API point to point operations: puts, gets, atomic memory update, and memory synchronization routines. The library is designed to serve as a plug-in to OpenSHMEM implementations and will be used by the OSSS OpenSHMEM reference implementation to support OpenSHMEM collective operations. In this paper, we describe the algorithms that have been incorporated in the implementation of each OpenSHMEM API collective routine and evaluated them on a Cray XC30 system. For long messages, SHCOLL shows an improvement by up to a factor of 12 compared to the vendor's implementation. We also discuss future development of the library, as well as how it will be incorporated into the OSSS OpenSHMEM reference implementation.

3:30 PM	Coffee Break
	Session 3: OpenSHMEM with Data Intensive Computation Chair: Weikuan Yu, FSU
4:00 PM	HOOVER: Distributed, Flexible, and Scalable Streaming Graph Processing on OpenSHMEM *Max Grossman* (Rice University), Howard Pritchard (Los Alamos National Laboratory), Anthony Curtis (Stony Brook University), and Vivek Sarkar (Georgia Institute of Technology). Many problems can benefit from being phrased as a graph processing or graph analytics problem: infectious disease modeling, insider threat detection, fraud prevention, social network analysis, and more. These problems all share a common property: the relationships between entities in these systems are crucial to understanding the overall behavior of the systems themselves. However, relations are rarely if ever static. As our ability to collect information on those relations improve (e.g. on financial transactions in fraud prevention), the value added by large-scale, high-performance, dynamic/streaming (rather than static) graph analysis becomes significant. This paper introduces HOOVER, a distributed software framework for large-scale, dynamic graph modeling and analysis. HOOVER sits on top of OpenSHMEM, a PGAS programming system, and enables users to plug in application-specific logic while handling all runtime coordination of computation and communication. HOOVER has demonstrated scaling out to 24,576 cores, and is flexible enough to support a wide range of graph-based applications, including infectious disease modeling and anomaly detection.
4:30 PM	Tumbling Down the GraphBLAS Rabbit Hole with SHMEM *Curtis Hughey* (DOD). In this talk we present shgraph, a SHMEM implementation of the GraphBLAS standard, which enables the user to redefine complex graph algorithms in terms of simple linear algebra primitives. It offers many nice features such as type abstraction, the ability to perform generalized matrix/vector operations over a semiring, and executing graph operations out-of-order (non-blocking mode). shgraph seeks to efficiently manage and process billion-edge or greater sparse graphs on an HPC system. We'll walk through sample GraphBLAS code and discuss the shgraph development process. In particular, we'll explain how SHMEM was used and where it was necessary to tweak the GraphBLAS specification to be compatible on a distributed system. Additionally, we'll analyze some preliminary performance results, map out next steps, and suggest potential applications.

Wednesday, August 22nd


8:00 AM	Working Breakfast and Registration Agenda: Discussion of the First Day and Recap Led by: Manjunath Gorentla Venkata, Oak Ridge National Laboratory
9:00 AM	Welcome and Introduction of Invited Talk: Manjunath Gorentla Venkata, ORNL
Session 4: Invited Talk
9:15 AM	ARM Presenter: Will Deacon Title: Formalising the ARMv8 Memory Consistency Model Armv8 introduced a radical change to the memory consistency model of the architecture by requiring that a store to memory becomes visible to all other threads at the same time. This property, known as other-multicopy atomicity, simplifies the memory model definition and supports straight-forward, compositional reasoning about concurrent programs. The memory model is now specified such that the architectural text maps directly to an executable, axiomatic model which can be used to verify properties of both concurrent software and processor designs. This presentation will provide an introduction to memory consistency models before focusing on the design of the Armv8 model and the tools which can be used to help reason about it.
10:15 AM	Break
	Session 5: Use of OpenSHMEM Applications and Profilers *Chair: Nick Park, DoD*
10:30 AM	Tracking Memory Usage in OpenSHMEM Runtimes with the TAU Performance System *Nicholas Chaimov (ParaTools)*, Sameer Shende (ParaTools), Allen Malony (ParaTools), Manjunath Gorentla Venkata (Oak Ridge National Laboratory), and Neena Imam (Oak Ridge National Laboratory). As the exascale era approaches, it is becoming increasingly important that runtimes be able to scale to very large numbers of processing elements. However, by keeping arrays of sizes proportional to the number of PEs, an OpenSHMEM implementation may be limited in its scalability to millions of PEs. In this paper, we describe techniques for tracking memory usage by OpenSHMEM runtimes, including attributing memory usage to runtime objects according to type, maintaining data about hierarchical relationships between objects and identification of the source lines on which allocations occur. We implement these techniques in the TAU Performance System using atomic and context events and demonstrate their use in OpenSHMEM applications running within the OpenMPI runtime, collecting both profile and trace data. We describe how we will use these tools to identify memory scalability bottlenecks in OpenSHMEM runtimes.
11:00 AM	Lightweight Instrumentation and Analysis using OpenSHMEM Performance Counters *Md. Wasi-ur-Rahman (Intel)*, David Ozog (Intel), and James Dinan (Intel). Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, are popular methods of parallel programming; however, performance monitoring and analysis tools for these models have remained elusive. In this work, we propose a performance counter extension to the OpenSHMEM interfaces to expose internal communication state as lightweight performance data to tools. We implement our interface in the open source Sandia OpenSHMEM library and demonstrate its mapping to libfabric primitives. Next, we design a simple collector tool to record the behavior of OpenSHMEM processes at execution time. We analyze the Integer Sort (ISx) benchmark and use the resulting data to investigate several common performance issues — including communication schedule, poor overlap, and load imbalance — and visualize the impact of optimizations to correct these issues. Through this study, our tool uncovered a performance bug in this popular benchmark. Finally, by using our tool to guide the application of several pipelining optimizations, we were able to improve the ISx key exchange performance by more than 30%.
11:30 AM	Oak Ridge OpenSHMEM Benchmark Suite *Thomas Naughton (Oak Ridge National Laboratory)*, Ferrol Aderholdt (Oak Ridge national Laboratory), Matthew Baker (Oak Ridge National Laboratory), Swaroop Pophale (Oak Ridge National Laboratory), and Manjunath Gorentla Venkata (Oak Ridge National Laboratory). The assessment of application performance is a fundamental task in high-performance computing (HPC). The OpenSHMEM Benchmark (OSB) suite is a collection of micro-benchmarks and mini-applications/compute kernels that have been ported to use OpenSHMEM. Some, like the NPB OpenSHMEM benchmarks, have been published before while most others have been used for evaluations but never formally introduced or discussed. This suite puts them together and is useful for assessing the performance of different use cases of OpenSHMEM. This offers system implementers a useful means of measuring performance and assessing the effects of different new features as well as implementation strategies. The suite is also useful for application developers to assess the performance of the growing number of OpenSHMEM implementations that are emerging. In this paper we describe current set of codes available within the OSB suite, how they are intended to be used, and, where possible, a snapshot of their behavior on one of the OpenSHMEM implementations available to us. We also include detailed descriptions of every benchmark and kernel, focusing on how OpenSHMEM was used. This includes details on the enhancements we made to the benchmarks to support multithreaded variants. We encourage the OpenSHMEM community to use, review, and provide feedback on the benchmarks.
12:00 PM	*OpenSHMEM Sets and Groups: An Approach to Worksharing and Memory Management* *Ferrol Aderholdt (Oak Ridge National Laboratory)*, Swaroop Pophale (Oak Ridge National Laboratory), Manjunath Gorentla Venkata (Oak Ridge National Laboratory), and Neena Imam (Oak Ridge National Laboratory). Collective operations in the OpenSHMEM programming model are defined over an active set, which is a grouping of Processing Elements (PEs) based on a triple of information including the starting PE, a log2 stride, and the size of the active set. In addition to the active set, collectives require Users to allocate and initialize synchronization (i.e., pSync) and scratchpad (i.e., pWrk) buffers for use by the collective operations. While active sets and the user-defined buffers were previously useful based on hardware and algorithmic considerations, future systems and applications require us to re-evaluate these concepts. In this paper, we propose Sets and Groups as abstractions to create persistent, flexible groupings of PEs (i.e., Sets) and couple these groups of PEs with memory spaces (i.e., Groups), which remove the allocation and initialization burden from the User. To evaluate Sets and Groups, we perform multiple micro-benchmarks to determine the overhead of these abstractions and demonstrate their utility by implementing a distributed All-Pairs Shortest Path (APSP) application, which we evaluate using multiple synthetic and real-world graphs.
12:30 PM	Working Lunch Agenda: Discussion and feedback on Session 4 and 5 Led by: Neena Imam, Oak Ridge National Laboratory
1:30 PM	OpenSHMEM Specification and Teams/Collectives Working Group Meeting: [View GitHub document] Overview of Sets/Groups Overview of Teams
3:30 PM	Coffee Break

4:00 PM	OpenSHMEM Specification and Teams/Collectives Working Group Meeting (Continued): Discussion of OpenSHMEM's Teams and Sets/Groups Discussion of OpenSHMEM Contexts and Future Hardware Interactions
5:00 PM	Adjourn for the day

Thursday, August 23rd

8:00 AM	Working Breakfast and Registration Agenda: OpenSHMEM Roadmap Led by: Manjunath Gorentla Venkata, Oak Ridge National Laboratory
9:00 AM	OpenSHMEM Specification Meeting: OpenSHMEM 1.5 Proposal Discussions OpenSHMEM 1.5 Proposal Ratifications
10:30 AM	Break
11:00 AM	OpenSHMEM Specification Meeting: OpenSHMEM 1.5 Proposal Ratifications OpenSHMEM Roadmap Discussions on Future Support for Heterogeneous Memories and Updated Semantics of OpenSHMEM Operations
12:00 PM	Closing Remarks and Adjourn: Neena Imam, Oak Ridge National Laboratory

Dates

OpenSHMEM 2018

AGENDA