OpenSHMEM 2018
AGENDA
Tuesday | Wednesday | Thursday
8:00 AM |
Working Breakfast and Registration |
9:00 AM |
Welcome: Neena Imam, Oak Ridge National Laboratory |
9:15 AM |
Invited Talk: InfiniBand In-Network Computing Technology and Roadmap |
10:15 AM |
Break |
Session 1: OpenSHMEM Implementations and API |
|
10:30 AM |
Design and Optimization of OpenSHMEM 1.4 for the Intel® Omni-Path Fabric 100 Series The OpenSHMEM 1.4 specification recently introduced support for multithreaded hybrid programming and a new communication management API. Together, these features enable users to manage communications performed by multiple threads within an OpenSHMEM process and to overlap communication and computation to hide costly latencies. In order to realize these benefits, OpenSHMEM implementations must efficiently map this broad new space of usage models to the underlying fabric. In this paper, we present an implementation of OpenSHMEM 1.4 for the Intel® Omni-Path Fabric 100 Series. The OpenFabrics Interface's (OFI) libfabric is used as the low-level fabric API in conjunction with the Intel® Performance Scaled Messaging 2 (PSM2) fabric provider. We identify strategies for effectively managing shared transmission resources using libfabric, as well as managing the communication requirements of the PSM2 layer. We study the performance of our implementation, identify design tradeoffs that are influenced by application behavior, and explore application-level optimizations that can be used to achieve the best performance. |
11:00 AM |
Introducing Cray OpenSHMEMX - A Modular Multi-Communication Layer OpenSHMEM Implementation SHMEM has a long history as a parallel programming model. It is extensively used since 1993, starting from Cray T3D systems. For the past two decades SHMEM library implementation in Cray systems evolved through different generations. The current generation of the SHMEM implementation for Cray XC and XK systems is called Cray SHMEM. It is a proprietary SHMEM implementation from Cray Inc. In this work, we provide an in-depth analysis of need for a new SHMEM implementation and then introduce the next evolution of Cray SHMEM implementation for current and future generation Cray systems. We call this new implementation Cray OpenSHMEMX. We provide brief design overview, along with a review of functional and performance differences in Cray OpenSHMEMX comparing against the existing Cray SHMEM implementation. |
11:30 AM |
An Initial Implementation of Libfabric Conduit for OpenSHMEM-X As a representative of Partitioned Global Address Space models, OpenSHMEM provides a variety of functionalities including one-sided communication, atomic operations, and collective routines. The communication layer of OpenSHMEM-X plays a crucial role for its functionalities. OFI Libfabric is an open-source network library that supports portable low-latency interfaces from different fabric providers while minimizing the semantic gap across API endpoints. In this paper, we present the design and implementation of OpenSHMEM-X communication conduit using Libfabric. This Libfabric conduit is designed to support a broad range of network providers while achieving excellent network performance and scalability. We have performed an extensive set of experiments to validate the performance of our implementation, and compared with the Sandia OpenSHMEM implementation. Our results show that the Libfabric conduit improves the communication bandwidth on the socket provider by up to 42% and 11%, compared to an alternative OpenSHMEM implementation for put and get operations, respectively. In addition, our implementation of atomic operations has achieved similar latency to that of the Sandia implementation. |
12:00 PM |
The OpenFAM API: a programming model for disaggregated persistent memory Recent technology advances in high-density, byte-addressable non-volatile memory (NVM) and low-latency interconnects have enabled building large-scale systems with a large disaggregated fabric-attached memory (FAM) pool shared across heterogeneous and decentralized compute nodes. In this model, compute nodes are decoupled from FAM, which allows separate evolution and scaling of processing and fabric-attached memory. The large capacity of the FAM pool means that large working sets can be maintained as in-memory data structures. The fact that all compute nodes share a common view of memory means that data sharing and communication may be done efficiently through shared memory, without requiring explicit messages to be sent over heavyweight network protocol stacks. Additionally, data sets no longer need to be partitioned between compute nodes, as is typically done in clustered environments. Any compute node can operate on any data item, which enables more dynamic and flexible load balancing. The OpenFAM API is an API for programming with persistent FAM that is inspired by the OpenSHMEM partitioned global address space (PGAS) model. Unlike OpenSHMEM, where each node contributes local memory toward a logically shared global address space, FAM isn't associated with a particular node and can be addressed directly from any node without the cooperation or involvement of another node. The OpenFAM API enables programmers to manage memory allocations, access FAM-resident data structures, and order FAM operations. Because state in FAM can survive program termination, the API also provides interfaces for naming and managing data beyond the lifetime of a single program invocation. |
12:30 PM |
Working Lunch |
Session 2: Simulators, Emulators, and OpenSHMEM Collectives |
|
1:30 PM |
Towards Lightweight and Scalable Simulation of Large-Scale OpenSHMEM Applications |
2:00 PM |
Scaling OpenSHMEM for Massively Parallel Processor Arrays The use of OpenSHMEM has traditionally focused on supporting a one-sided communication mechanism between networked processors. The US Army Research Laboratory (ARL) OpenSHMEM implementation for the Epiphany architecture has highlighted the utility of OpenSHMEM for the precise control of on-die data movement within arrays of RISC cores connected by a 2D mesh Network on Chip (NoC), and was demonstrated using a 16-core Epiphany-III co-processor. More recently, DARPA has fabricated a much larger 64-bit 1,024-core Epiphany-V device, which ARL is presently evaluating. In support of this effort, we have developed an Epiphany-based RISC SoC device emulator that can be installed as a virtual device on an ordinary x86 platform and utilized with the existing software stack used to support physical devices, thus creating a seamless software development environment capable of targeting new processor designs just as they would be interfaced on a real platform. As massively parallel processor arrays (MPPAs) emerge as a strong contender for future exascale architectures, we investigate the application of OpenSHMEM as a programming model for processors with hundreds to thousands of cores. In this work we report on the initial results from scaling up the ARL OpenSHMEM implementation using virtual RISC processors with much larger core counts than previous physical devices. |
2:30 PM |
Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM Graphics Processing Units (GPUs) are well-known for their massive parallelism and high bandwidth memory for data-intensive applications. In this context, GPU-based In-Memory Key-Value (G-IMKV) Stores have been proposed to take advantage of GPUs' capability to achieve high-throughput data processing. The state-of-the-art frameworks in this area batch requests on the CPU at the server before launching a compute kernel to process operations on the GPU. They also use explicit data movement operations between the CPU and GPU. The startup overhead of compute kernel launches and memory copies limits the throughput of these frameworks unless operations are batched into large groups. In this paper, we propose the use of persistent GPU compute kernels and of OpenSHMEM to maximize GPU and network utilization with smaller batch sizes. This also helps improve the response time observed by clients while still achieving high throughput at the server. Specifically, clients and servers use OpenSHMEM primitives to move data by avoiding copies, and the server interacts with a persistently running compute kernel on its GPU to delegate various key-value store operations efficiently to streaming multi-processors. |
3:00 PM |
SHCOLL - a Standalone Implementation of OpenSHMEM-style Collectives API The performance of collective operations has a large impact on overall performance in many HPC applications. Implementing multiple algorithms and selecting optimal one depending on message size and the number of processes involved in the operation is essential to achieve good performance. In this paper, we will present SHCOLL, a collective routines library that was developed on top of OpenSHMEM API point to point operations: puts, gets, atomic memory update, and memory synchronization routines. The library is designed to serve as a plug-in to OpenSHMEM implementations and will be used by the OSSS OpenSHMEM reference implementation to support OpenSHMEM collective operations. In this paper, we describe the algorithms that have been incorporated in the implementation of each OpenSHMEM API collective routine and evaluated them on a Cray XC30 system. For long messages, SHCOLL shows an improvement by up to a factor of 12 compared to the vendor's implementation. We also discuss future development of the library, as well as how it will be incorporated into the OSSS OpenSHMEM reference implementation. |
3:30 PM |
Coffee Break |
Session 3: OpenSHMEM with Data Intensive Computation |
|
4:00 PM |
HOOVER: Distributed, Flexible, and Scalable Streaming Graph Processing on OpenSHMEM Many problems can benefit from being phrased as a graph processing or graph analytics problem: infectious disease modeling, insider threat detection, fraud prevention, social network analysis, and more. These problems all share a common property: the relationships between entities in these systems are crucial to understanding the overall behavior of the systems themselves. However, relations are rarely if ever static. As our ability to collect information on those relations improve (e.g. on financial transactions in fraud prevention), the value added by large-scale, high-performance, dynamic/streaming (rather than static) graph analysis becomes significant. This paper introduces HOOVER, a distributed software framework for large-scale, dynamic graph modeling and analysis. HOOVER sits on top of OpenSHMEM, a PGAS programming system, and enables users to plug in application-specific logic while handling all runtime coordination of computation and communication. HOOVER has demonstrated scaling out to 24,576 cores, and is flexible enough to support a wide range of graph-based applications, including infectious disease modeling and anomaly detection. |
4:30 PM |
Tumbling Down the GraphBLAS Rabbit Hole with SHMEM In this talk we present shgraph, a SHMEM implementation of the GraphBLAS standard, which enables the user to redefine complex graph algorithms in terms of simple linear algebra primitives. It offers many nice features such as type abstraction, the ability to perform generalized matrix/vector operations over a semiring, and executing graph operations out-of-order (non-blocking mode). |
8:00 AM |
Working Breakfast and Registration |
|
9:00 AM |
Welcome and Introduction of Invited Talk: Manjunath Gorentla Venkata, ORNL |
|
Session 4: Invited Talk |
||
9:15 AM |
ARM Title: Formalising the ARMv8 Memory Consistency Model Armv8 introduced a radical change to the memory consistency model of the architecture by requiring that a store to memory becomes visible to all other threads at the same time. This property, known as other-multicopy atomicity, simplifies the memory model definition and supports straight-forward, compositional reasoning about concurrent programs. The memory model is now specified such that the architectural text maps directly to an executable, axiomatic model which can be used to verify properties of both concurrent software and processor designs. |
|
10:15 AM |
Break |
|
Session 5: Use of OpenSHMEM Applications and Profilers |
|
|
10:30 AM |
Tracking Memory Usage in OpenSHMEM Runtimes with the TAU Performance System As the exascale era approaches, it is becoming increasingly important that runtimes be able to scale to very large numbers of processing elements. However, by keeping arrays of sizes proportional to the number of PEs, an OpenSHMEM implementation may be limited in its scalability to millions of PEs. In this paper, we describe techniques for tracking memory usage by OpenSHMEM runtimes, including attributing memory usage to runtime objects according to type, maintaining data about hierarchical relationships between objects and identification of the source lines on which allocations occur. We implement these techniques in the TAU Performance System using atomic and context events and demonstrate their use in OpenSHMEM applications running within the OpenMPI runtime, collecting both profile and trace data. We describe how we will use these tools to identify memory scalability bottlenecks in OpenSHMEM runtimes. |
|
11:00 AM |
Lightweight Instrumentation and Analysis using OpenSHMEM Performance Counters |
|
11:30 AM |
Oak Ridge OpenSHMEM Benchmark Suite The assessment of application performance is a fundamental task in high-performance computing (HPC). The OpenSHMEM Benchmark (OSB) suite is a collection of micro-benchmarks and mini-applications/compute kernels that have been ported to use OpenSHMEM. Some, like the NPB OpenSHMEM benchmarks, have been published before while most others have been used for evaluations but never formally introduced or discussed. This suite puts them together and is useful for assessing the performance of different use cases of OpenSHMEM. This offers system implementers a useful means of measuring performance and assessing the effects of different new features as well as implementation strategies. The suite is also useful for application developers to assess the performance of the growing number of OpenSHMEM implementations that are emerging. In this paper we describe current set of codes available within the OSB suite, how they are intended to be used, and, where possible, a snapshot of their behavior on one of the OpenSHMEM implementations available to us. We also include detailed descriptions of every benchmark and kernel, focusing on how OpenSHMEM was used. This includes details on the enhancements we made to the benchmarks to support multithreaded variants. We encourage the OpenSHMEM community to use, review, and provide feedback on the benchmarks. |
|
12:00 PM |
OpenSHMEM Sets and Groups: An Approach to Worksharing and Memory Management Collective operations in the OpenSHMEM programming model are defined over an active set, which is a grouping of Processing Elements (PEs) based on a triple of information including the starting PE, a log2 stride, and the size of the active set. In addition to the active set, collectives require Users to allocate and initialize synchronization (i.e., pSync) and scratchpad (i.e., pWrk) buffers for use by the collective operations. While active sets and the user-defined buffers were previously useful based on hardware and algorithmic considerations, future systems and applications require us to re-evaluate these concepts. In this paper, we propose Sets and Groups as abstractions to create persistent, flexible groupings of PEs (i.e., Sets) and couple these groups of PEs with memory spaces (i.e., Groups), which remove the allocation and initialization burden from the User. To evaluate Sets and Groups, we perform multiple micro-benchmarks to determine the overhead of these abstractions and demonstrate their utility by implementing a distributed All-Pairs Shortest Path (APSP) application, which we evaluate using multiple synthetic and real-world graphs. |
|
12:30 PM |
Working Lunch |
|
1:30 PM |
OpenSHMEM Specification and Teams/Collectives Working Group Meeting:
|
|
3:30 PM |
Coffee Break |
|
|
||
4:00 PM |
OpenSHMEM Specification and Teams/Collectives Working Group Meeting (Continued):
|
|
5:00 PM |
Adjourn for the day |
|
8:00 AM |
Working Breakfast and Registration |
9:00 AM |
OpenSHMEM Specification Meeting:
|
10:30 AM |
Break |
11:00 AM |
OpenSHMEM Specification Meeting:
|
12:00 PM |
Closing Remarks and Adjourn: Neena Imam, Oak Ridge National Laboratory |