Recent trends in high-performance computing (HPC) systems have clearly
indicated that future increases in performance, in excess of those resulting
from improvements in single-processor performance, will be achieved through
corresponding increases in system scale, i.e., using a significantly larger
component count. As the raw computational performance of the world's fastest
HPC systems increases from today's current tera-scale to next-generation
peta-scale capability and beyond, their number of computational, networking,
and storage components will grow from the ten-to-one-hundred thousand
compute nodes of today's systems to several hundreds of thousands of compute
nodes and more in the foreseeable future. This substantial growth in system
scale, and the resulting component count, poses a challenge for HPC system
and application software with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with
non-recoverable soft errors, i.e., bit flips in memory, cache, registers,
and logic added another major source of concern. The probability of such
errors not only grows with system size, but also with increasing
architectural vulnerability caused by employing accelerators, such as FPGAs
and GPUs, and by shrinking nanometer technology. Reactive fault tolerance
technologies, such as checkpoint/restart, are unable to handle high failure
rates due to associated overheads, while proactive resiliency technologies,
such as preemptive migration, simply fail as random soft errors can't be
predicted. Moreover, soft errors may even remain undetected resulting in
silent data corruption.
The goal of the HPC Resiliency Summit is to bring together experts in the
area of fault tolerance and resiliency for high-performance computing from
national laboratories and universities to present their achievements and to
discuss the challenges ahead. The secondary goal is to raise awareness in
the HPC community about existing solutions, ongoing and planned work, and
future research and development needs. The workshop program consists of a
series of invited talks by experts and a round table discussion.
Workshop general co-chairs:
Stephen L. Scott Computer Science and Mathematics Division Oak Ridge National Laboratory |
Chokchai (Box) Leangsuksun eXtreme Computing Research Group Louisiana Tech University |
|
|
Program co-chairs:
Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory |
7:30 - 8:15AM : | Breakfast |
8:15 - 8:30AM : |
Welcome
Stephen L. Scott, Oak Ridge National Laboratory |
8:30 - 9:30AM : |
Keynote: Resilience Challenges
John T. Daly, U.S. Department of Defense [slides] |
9:30 - 10:00AM : |
Increasing Fault Resiliency in a Message-Passing Environment
Rolf Riesen, Sandia National Laboratories [slides] |
10:00 - 10:30AM : | Coffee Break |
10:30 - 11:00AM : |
Transparent Process-level Fault Tolerance for MPI: Challenges and Solutions
Frank Mueller, North Carolina State University [slides] |
11:00 - 11:30AM : |
Overview of the Scalable Checkpoint/Restart (SCR) Library
Adam Moody, Lawrence Livermore National Laboratory [slides] |
11:30 - 12:00AM : |
Designing Fault Resilient and Fault Tolerant Systems with InfiniBand
D.K. Panda, The Ohio State University [slides] |
12:00 - 1:30AM : | Lunch Break |
1:30 - 2:00PM : |
Adaptive Runtime Support for Fault Tolerance
Esteban Meneses, Celso Mendes, and Laxmikant Kale, University of Illinois at Urbana-Champaign [slides] |
2:00 - 2:30PM : |
Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box
Jim Brandt, Sandia National Laboratory |
2:30 - 3:00PM : |
Reliablity-Aware Scalability Models for High Performance Computing
Ziming Zheng, Illinois Institute of Technology [slides] |
3:00 - 3:30PM : | Coffee Break |
3:30 - 4:00PM : |
Highly Scalable Fault Tolerance for Exascale HPC
Zizhong (Jeffrey) Chen, Colorado School of Mines [slides] |
4:00 - 4:30PM : |
Fault Tolerant Algorithms for Heat Transfer Problems
Hatem Ltaief, University of Tennessee - Knoxville [slides] |
4:30 - 5:00PM : | Discussion & Closing |