Recent trends in high-performance computing (HPC) systems have clearly
indicated that future increases in performance, in excess of those resulting
from improvements in single-processor performance, will be achieved through
corresponding increases in system scale, i.e., using a significantly larger
component count. As the raw computational performance of the world's fastest
HPC systems increases from today's current tera-scale to next-generation
peta-scale capability and beyond, their number of computational, networking,
and storage components will grow from the ten-to-one-hundred thousand
compute nodes of today's systems to several hundreds of thousands of compute
nodes and more in the foreseeable future. This substantial growth in system
scale, and the resulting component count, poses a challenge for HPC system
and application software with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with
non-recoverable soft errors, i.e., bit flips in memory, cache, registers,
and logic added another major source of concern. The probability of such
errors not only grows with system size, but also with increasing
architectural vulnerability caused by employing accelerators, such as FPGAs
and GPUs, and by shrinking nanometer technology. Reactive fault tolerance
technologies, such as checkpoint/restart, are unable to handle high failure
rates due to associated overheads, while proactive resiliency technologies,
such as preemptive migration, simply fail as random soft errors can't be
predicted. Moreover, soft errors may even remain undetected resulting in
silent data corruption.
The goal of the Workshop on Resiliency for Petascale HPC is to bring
together experts in the area of fault tolerance and resiliency for
high-performance computing from national laboratories and universities to
present their achievements and to discuss the challenges ahead. The
secondary goal is to raise awareness in the HPC community about existing
solutions, ongoing and planned work, and future research and development
needs. The workshop program consists of a series of invited talks by experts
and a round table discussion.
Workshop general co-chairs:
Stephen L. Scott Computer Science and Mathematics Division Oak Ridge National Laboratory, USA scottsl@ornl.gov |
Chokchai (Box) Leangsuksun eXtreme Computing Research Group Computer Science Program Louisiana Tech University, USA box@latech.edu |
Program co-chairs:
Mihaela Paun Mathematics and Statistics Program Louisiana Tech University, USA mpaun@latech.edu |
Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory, USA engelmannc@ornl.gov |
7:30 - 8:15AM : | Breakfast |
8:15 - 8:30AM : |
Welcome
Stephen L. Scott, Oak Ridge National Laboratory |
8:30 - 9:00AM : |
Resilience: Sacrificing Previous Convictions About Physical Laws
John T. Daly, Los Alamos National Laboratory |
9:00 - 9:00AM : |
Failure in Supercomputers and Supercomputer Storage
Garth Gibson, Carnegie Mellon University / Panasas, Inc. [slides] |
9:30 - 10:00AM : |
System-level Checkpoint/Restart with BLCR
Paul Hargrove, Lawrence Berkeley National Laboratory [slides] |
10:00 - 10:30AM : | Coffee Break |
10:30 - 11:00AM : |
Process-Level Fault Tolerance for Job Healing in HPC Environments
Stephen L. Scott, Oak Ridge National Laboratory [slides] |
11:00 - 11:30AM : |
A coordinated infrastructure for Fault Tolerant Systems (CIFTS)
Rinku Gupta, Argonne National Laboratory [slides] |
11:30 - 12:00AM : |
Towards Support for Fault Tolerance in the MPI Standard
Greg Koenig, Oak Ridge National Laboratory [slides] |
12:00 - 1:30AM : | Lunch Break |
1:30 - 2:00PM : |
Studying Systems as Artifacts
Adam J. Oliner, Stanford University [slides] |
2:00 - 2:30PM : |
Combining System Characterization
and Novel Execution Models to Achieve Scalable Robust Computing
Jim Brandt, Sandia National Laboratory |
2:30 - 3:00PM : |
Root Cause Analysis
Jon Stearley, Sandia National Laboratory [slides] |
3:00 - 3:30PM : | Coffee Break |
3:30 - 4:00PM : |
Accurate Prediction of Soft Error Vulnerability of Scientific Applications
Greg Bronevetsky, Lawrence Livermore National Laboratory [slides] |
4:00 - 4:30PM : |
Modular Redundancy in HPC Systems: Why, Where, When and How?
Christian Engelmann, Oak Ridge National Laboratory [slides] |
4:30 - 5:00PM : |
Making Resilience a Reality Through a Resilience Consortium
James Elliott, Louisiana Tech University [slides] |
5:00 - 5:30PM : | Discussion & Closing |