

# To Exascale And Beyond: Intel's Scalable System Framework and OpenSHMEM

James Dinan Extreme Scale Software Pathfinding Team

OpenSHMEM Workshop August 2016

### Legal Notices and Disclaimers

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>.

Intel, the Intel logo, Xeon and Xeon Phi and others are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.



# **INTEL'S SCALABLE SYSTEM FRAMEWORK**

A design foundation enabling a wide range of highly workload-optimized solutions



Small Clusters Through Supercomputers Compute and Data-Centric Computing Standards-Based Programmability On-Premise and Cloud-Based

| Intel <sup>®</sup> Xeon <sup>®</sup> Processors | Intel <sup>®</sup> True Scale Fabric |
|-------------------------------------------------|--------------------------------------|
| Intel® Xeon Phi™<br>Coprocessors                | Intel® Omni-Path<br>Architecture     |
| Intel® Xeon Phi™ Processors                     | Intel® Ethernet                      |

Intel<sup>®</sup> SSDs

Intel<sup>®</sup> Lustre-based Solutions Intel<sup>®</sup> Silicon Photonics Technology Intel<sup>®</sup> Software Tools HPC Scalable Software Stack Intel<sup>®</sup> Cluster Ready Program



Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit





TIME

(intel

## Intel<sup>®</sup> Omni-Path Architecture

Evolutionary Approach, Revolutionary Features, End-to-End Solution



#### Building on the industry's best technologies

- Highly leverage existing Aries and Intel<sup>®</sup> True Scale Fabric
- Adds innovative new features and capabilities to improve performance, reliability, and QoS
- Re-use of existing OpenFabrics Alliance\* software

#### **Robust product offerings and ecosystem**

- End-to-end Intel product line
- >100 OEM designs<sup>1</sup>
- Strong ecosystem with 70+ Fabric Builders members

<sup>1</sup> Source: Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standard Intel® OPA adapters. Design win count as of November 1, 2015 and subject to change without notice based on vendor product plans. \*Other names and brands may be claimed as property of others.



### Intel® Omni-Path Architecture Network Layers

Layer 1 – Physical Layer

Leverages existing Ethernet and InfiniBand\* PHY standards

#### Layer 1.5 – Link Transfer Protocol

 Provides reliable delivery of Layer 2 packets, flow control, and link control across a single link

Layer 2 – Data Link Layer

Provides fabric addressing, switching, resource allocation, and partitioning support

Layers 4-7 – Transport to Application Layers

- Provide interfaces between software libraries and HFIs

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2016, Intel Corporation.



### Intel<sup>®</sup> Omni-Path Architecture Link Transfer Layer



#### **PiP (Packet Integrity Protection) – Link error detection/correction in units of LTPs**

(intel)

# **New Intel® OPA Fabric Features:** Fine-grained Control Improves Resiliency and Optimizes Traffic Movement

|                                | Description                                                                                                                                                                                                                                              | Benefits                                                                                                                                                                                                         |
|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Traffic Flow<br>Optimization   | <ul> <li>Optimizes Quality of Service (QoS) in mixed<br/>traffic environments, such as storage and MPI</li> <li>Transmission of lower-priority packets can be<br/>paused so higher priority packets can be<br/>transmitted</li> </ul>                    | <ul> <li>Ensures high priority traffic is not<br/>delayed →Faster time to solution</li> <li>Deterministic latency → Lowers run-to-<br/>run timing inconsistencies</li> </ul>                                     |
| Packet Integrity<br>Protection | <ul> <li>Allows for rapid and transparent recovery<br/>of transmission errors on an Intel<sup>®</sup> OPA link<br/>without additional latency</li> <li>Resends 1056-bit bundle w/errors only instead<br/>of entire packet (based on MTU size)</li> </ul> | <ul> <li>Fixes happen at the link level rather<br/>than end-to-end level</li> <li>Much lower latency than Forward<br/>Error Correction (FEC) defined in the<br/>InfiniBand* specification<sup>1</sup></li> </ul> |
| Dynamic Lane<br>Scaling        | <ul> <li>Maintain link continuity in the event of a failure<br/>of one of more physical lanes</li> <li>Operates with the remaining lanes until the<br/>failure can be corrected at a later time</li> </ul>                                               | <ul> <li>Enables a workload to continue to<br/>completion. Note: InfiniBand will shut<br/>down the entire link in the event of a<br/>physical lane failure</li> </ul>                                            |

<sup>1</sup> Lower latency based on the use of InfiniBand with Forward Error Correction (FEC) Mode A or C in the public presentation titled "Option to Bypass Error Marking (supporting comment #205)," authored by Adee Ran (Intel) and Oran Sela (Mellanox), January 2013. Mode A modeled to add as much as 140ns latency above baseline, and Mode C can add up to 90ns latency above baseline. Link: www.ieee802.org/3/bj/public/jan13/ran\_3bj\_01a\_0113.pdf



### Intel<sup>®</sup> Omni-Path Fabric Link Level Innovation: Dynamic Lane Scaling (DLS) Traffic Protection



#### **User Setting (per Fabric):**

- Set maximum degrade option allowable
  - 4x Any lane failure would cause link reset or take down
  - 3x Still operates at degraded bandwidth (75 Gbps)
  - 2x Still operates at degraded bandwidth (50 Gbps)
  - 1x Still operates at degraded bandwidth (25 Gbps)

#### Link Recovery:

PIP is used to recover link without reset – An Intel<sup>®</sup> OPA innovation Intel<sup>®</sup> OPA still passing data at reduced bandwidth with link recovery via PIP

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. \*Other names and brands may be claimed as property of others.



#### Intel<sup>®</sup> Omni-Path Fabric Link Level Innovation: **Traffic Flow Optimization (TFO) - Preemption**





### Knights Landing: Next Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor



First **self-boot** Intel® Xeon Phi<sup>™</sup> processor that is **binary compatible** with main line IA. Boots standard OS.

Significant improvement in scalar and vector performance

Integration of **Memory on package**: innovative memory architecture for high bandwidth and high capacity

Integration of Fabric on package

Potential future options subject to change without notice. All timeframes, features, products and dates are preliminary forecasts and subject to change without further notification.



### **Knights Landing Overview**





Chip: 36 Tiles interconnected by 2D Mesh
Tile: 2 Cores + 2 VPU/core + 1 MB L2
Memory: MCDRAM: 16 GB on-package; High BW
DDR4: 6 channels @ 2400 up to 384GB
IO: 36 lanes PCIe\* Gen3. 4 lanes of DMI for chipset
Node: 1-Socket only
Fabric: Omni-Path on-package (not shown)
Vector<sup>1</sup>: up to 2 TF/s Linpack/DGEMM; 4.6 TF/s SGEMM
Streams Triad<sup>1</sup>: MCDRAM up to 490 GB/s; DDR4 90 GB/s
Scalar<sup>2</sup>: Up to ~3x over current Intel® Xeon Phi<sup>™</sup>
co-processor 7120 ("Knights Corner")

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Configurations:

 Intel Xeon Phi processor 7250 (16GB, 1.4 GHz, 68-cores) running LINPACK (score 2000 GFLOPS), DGEMM (score 2070 GFLOPS), SGEMM (4605 GFLOPS), STREAM (DDR4 = 90 GB/s) and MCDRAM = 490 GB/s), 96 GB DDR4-2133 memory, BIOS R00.RC085, Cluster Mode = Quad, MCDRAM Flat or Cache, RHEL\* 7.0, MPSP 1.2.2, Intel MKL 11.3.2, Intel MPI 5.1.2, DGEMM 20K x 20K, LINPACK 100K x 100K size
 Intel estimates based on estimated 1-user SPECint\*\_rate\_base2006 comparing configuration 1 to Intel Xeon Phi co-processor 7120A hosted on 2x Intel Xeon processor E5-2697 v3.

# **Knights Landing Products**





# Intel ISA

#### **KNL** implements all legacy instructions

#### **AVX-512 Extensions**

- 512-bit FP/Integer Vectors
- 32 regs, & 8 mask regs
- Gather/Scatter

Conflict Detection: Improves Vectorization Prefetch: Gather and Scatter Prefetch

**Exponential and Reciprocal Instructions** 

1

(intel)

# Three Memory Modes, Selected at Boot



(intel)

### KNL Mesh Interconnect – Mesh of Rings



#### **Three Cluster Modes:**

- 1.All-to-All: No affinity between Tile, Directory and Memory
- 2.Quadrant: Affinity between Directory and Memory: Default mode. SW transparent

(intel)

3.Sub-NUMA Clustering: Affinity between Tile, Directory, Memory. SW visible

### **Observations for OpenSHMEM**

On-node communication is growing in importance

- OpenSHMEM can and should address both off-node and on-node communication needs
- PGAS can provide coherence alternative

Hybrid processes + threads programming is already important

- Threads have an advantage with sharing on-node resources
- OpenSHMEM is late to the party let's bring something good!

Heterogeneous memory has arrived

- HBW, DDR, large pages, NUMA, nonvolatile memory, ...
- API should enable access to diverse memory technologies and allow users to control data placement and locality







### Why Thread Safety Is Not Enough

Saturating the fabric with small messages

- Message rate of 160M/sec
- Processor with 72 cores
- Assuming all cores are sending messages,
  - $T = T_{inject} + T_{compute} = 72/160M = 450ns$



- OpenSHMEM threading extensions must not burden critical paths
  - Taking a mutex
  - Accessing thread-local storage
  - Issuing an atomic operation

#### Contexts were designed to integrate threads while avoiding these overheads

Opinions expressed are those of the speaker and do not necessarily reflect the views of Intel Corp.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <u>http://www.nictloaneneef.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.complete.comple</u>





### **Contexts: You Want Them**



