Held in conjunction with the
Los Alamos Computer Science Symposium (LACSS) 2010
Santa Fe, New Mexico, USA,
October 13, 2010.
http://www.csm.ornl.gov/srt/conferences/ResilienceSummit
Recent trends in high-performance computing (HPC) systems have clearly
indicated that future increases in performance, in excess of those
resulting from improvements in single-processor performance, will be
achieved through corresponding increases in system scale, i.e., using a
significantly larger component count. As the raw computational
performance of the world's fastest HPC systems increases from today's
current peta-scale to next-generation exa-scale capability and beyond,
their number of computational, networking, and storage components will
grow from the ten-to-one-hundred thousand compute nodes of today's
systems to several hundreds of thousands of compute nodes and more in
the foreseeable future. This substantial growth in system scale, and the
resulting component count, poses a challenge for HPC system and
application software with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with
non-recoverable soft errors, i.e., bit flips in memory, cache,
registers, and logic added another major source of concern. The
probability of such errors not only grows with system size, but also
with increasing architectural vulnerability caused by employing
accelerators, such as FPGAs and GPUs, and by shrinking nanometer
technology. Reactive fault tolerance technologies, such as
checkpoint/restart, are unable to handle high failure rates due to
associated overheads, while proactive resilience technologies, such as
preemptive migration, simply fail as random soft errors can't be
predicted. Moreover, soft errors may even remain undetected resulting in
silent data corruption.
The goal of the Workshop on Resilience for Exascale HPC is to bring
together experts in the area of fault tolerance and resilience for
high-performance computing from national laboratories and universities
to present their achievements and to discuss the challenges ahead. The
secondary goal is to raise awareness in the HPC community about existing
solutions, ongoing and planned work, and future research and development
needs. The workshop program consists of a series of invited talks by
experts and a round table discussion.
Speakers: |
Imran Haque, Stanford University, USA
"Hard Data on Soft Errors: A Global-Scale Assessment of GPGPU Memory Soft Error Rates" |
|
Sarah E. Michalak, Los Alamos National Laboratory
"Soft Errors, Silent Data Corruption, and Exascale Computing" |
|
Christian Engelmann, Oak Ridge National Laboratory, USA
"Scalable HPC System Monitoring" |
|
Jim Brandt, Sandia National Laboratories, USA
"Scalable HPC Monitoring and Analysis for Understanding and Automated Response" |
|
Ana Gainaru, University of Illinois at Urbana-Champaign, USA
"Mining event log patterns in HPC systems" |
|
Rob Aulwes, Los Alamos National Laboratory, USA
"Integrating Fault Tolerance into the Monte Carlo Application Toolkit" |
|
Chokchai (Box) Leangsuksun, Louisiana Tech University, USA
"HPC rejuvenation and GPGPU checkpoint model" |
|
Amina Guermouche, INRIA, France
"An Uncoordinated Checkpoint Protocol for Send-deterministic HPC Application" |
|
Edgar Gabriel, University of Houston,
"VolpexMPI: Robust Execution of MPI Applications through Process Replication" |
Workshop general co-chairs:
Stephen L. Scott Computer Science and Mathematics Division Oak Ridge National Laboratory |
Chokchai (Box) Leangsuksun eXtreme Computing Research Group Louisiana Tech University |
|
|
Program chair:
Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory |
Program committee:
Sean Blanchard, Los Alamos National Laboratory Jim Brandt, Sandia National Laboratories, USA Greg Bronevetsky, Lawrence Livermore National Laboratory Franck Cappello, UIUC-INRIA Joint Laboratory on PetaScale Computing Nathan DeBardeleben, Advanced Computing Systems Program, DoD Ann Gentile, Sandia National Laboratories |