# A Perspective on The Path Forward (Why I'm not too worried about Exascale)

Steve Scott Cray CTO

SOS20 March 23, 2016

COMPUTE | STORE | ANALYZE

# **Technology Drives Architecture**



Cray 1, 1976

 ECL 5/4 NAND gate ICs (95%)



COMPUTE

- 75K gates. (3400 PCBs!)
- RISC design
- Vector ISA
- Memory latency 11 clocks



Intel Pentium, 1993

- CMOS VLSI IC
- 3M transistors
- CISC design
- Scalar ISA
- Deep pipelines, complex predictions

STORE



#### Intel Pentium 4 Cedar Mill, 2006

- 184M transistors!
- Very CISC design
- 31-stage pipeline
- 3.6 GHz in 65nm
- Last of its breed....

Copyright 2016 Cray Inc.

ANALYZE



3

# And then Dennard scaling ended...

**Power constrained** 

**Communication much more expensive than Computation** 

COMPUTE | STORE | ANALYZE Copyright 2016 Cray Inc.

#### New Processor Landscape Driven by power efficiency



# **On-Package Memory Can Restore Balance**

- Standard DDR memory BW has not kept pace with CPUs
- Expect processors to adopt stacked, on-package memory
- HBM:
  - 10x higher BW, 10x less energy/bit
  - Much lower latency
  - Costs ~2x DDR4 per bit
  - JDEC standard with multiple sources





# May drive us to smaller, simpler nodes that are balanced with on-package memory

STORE | A Copyright 2016 Cray Inc.

ANALYZE

5

CRAY

# **Deeper Memory and Storage Hierarchy**



# **Storage Will Scale**

#### • APEX requirement: Time to checkpoint 80% memory < (0.005)\*JMTTI

- Extrapolate to Exascale sytem:
- Assume saving 80% of 32 PB of memory and a JMTTI of 10 hours  $\Rightarrow$  requires checkpoint bandwidth of ~150 TB/s (doable with distributed Flash)
- Primary resiliency issue is dealing with *undetected* errors...

#### Storage latencies dropping faster than compute increasing

- Flash O(100) faster than disk
- NVRAM is O(100) faster than Flash

#### • But there's lots of work to do on storage architecture..

- Reducing software overheads for Flash and NVRAM timescale
- Metadata scaling and resiliency (relax Posix consistency?)
- Namespace flexibility
- Support for non-POSIX file systems (KVS, NoSQL, Spark RDDs, etc.)





# **Cost- and Power-Efficient Networks**

- Cray pioneered the use of high radix routers in HPC
  - Became optimal due to technology shift
    - Faster signaling permits narrower links
  - Reduced network diameter (number of hops)
    - $\Rightarrow$  Lower latency and cost
  - But... higher radix network require longer cable lengths
- Optics enables longer cable lengths
  - Now cost-effective above a few meters (and dropping)
  - Cost, bandwidth and power are insensitive to cable length
- Future systems will based on hybrid, electrical-optical networks
  - Cost-effective, scalable global bandwidth
  - Very low network diameter (small number of hops) ⇒ very energy efficient



First 64 port router Cray X2 (2005)

# Example Dragonfly Network with a 64-port Switch CRAY



- Scales to 279K endpoints, with a network diameter of 3 hops!
- Only a *single* hop over a long (optical) link
- Narrow links allow sliced network for configurable bandwidth

COMPUTE

Copyright 2016 Cray Inc.

ORF

ST

ANALYZE

9

## **Next-Gen Shasta System Infrastructure**

#### • Single system with choice of:

- Cabinet type and cooling infrastructure
- Processor type
- Software stack
- Interconnect

#### • Extensible to Exascale and Beyond

- Power & cooling headroom
- Network and processor configurability





# **Summary of Future Machines**

- Computers are not getting faster... just wider
  - O(EF) with O(GHz) clocks  $\rightarrow O(B)$  way parallelism!
- Vertical locality much more important than horizontal locality

| Dimension    | Latency Hit | Bandwidth Hit | Energy Hit |
|--------------|-------------|---------------|------------|
| Within node  | ~200x       | ~200x         | > 500x     |
| Across nodes | ~25x        | ~8x           | ~5x        |

\* If include local NVM, within node grows, across nodes shrinks

#### • Parallelism is multi-dimensional (and heterogeneous?)

- Vectorization + threading + multi-node
- Processors optimized for serial performance *or* power efficiency (not both)

#### Interconnects won't look that different than today



# **Implications for Programmers**

#### • May need to move to more threading on the node

• All-MPI often won't deliver maximum performance

#### • Must vectorize low-level loops

• 8-30x performance improvement on array operations

#### Must avoid serial scalar code

- Inherently slower and less power-efficient
- On "accelerated" nodes, either
  - creates traffic between accelerator and host, or
  - runs 3-4x slower than on a serial-optimized core

#### • Must pay a lot more attention to locality within node

- Think about data placement and movement
- Consider "sub-optimal" algorithms that limit data motion



# Would like to code for future machines in a portable way

• Spatial and Temporal Portability



#### Separation of labor

- Programmer exposes parallelism and locality
- Compiler, tools, and runtime map onto specific hardware
- Optimized libraries for various platforms

## **Bold Prediction:**

- Future HPC Programming Model: MPI + OpenMP
- Can we make this easier?
  - Threading, vectorization, data placement
- Recent poll at NERSC found 80% of apps use single level of parallelism

STORE

Copyright 2016 Cray Inc.

#### • Why & when to convert to hybrid programming model?

- When code becomes network bound
- Load balancing and synchronization overheads become large
- Excessive memory used by straight MPI
- To take advantage of hybrid compute nodes

#### Programming tools are going to be critical

• Exposing parallelism (especially higher in call chain)

COMPUTE

• Data placement and movement in the memory hierarchy



#### **Beyond Classic HPC**











COMPUTE

STORE

ANALYZE

# **Merging of HPC and Data Analytics**







#### HPC + Analytics workflows



#### HPC underneath the covers

COMPUTE | STORE | ANALYZE

# **Thank You**

# **Questions?**



Copyright 2016 Cray Inc.

(17)