Resilience Summit 2008

Held in conjunction with the Los Alamos Computer Science Symposium (LACSS) 2008
Santa Fe, New Mexico October 15, 2008.
http://www.csm.ornl.gov/srt/ResilienceSummit2008/

7:30 - 8:15AM : Breakfast
8:15 - 8:30AM : Welcome
8:30 - 9:00AM : Resilience: Sacrificing Previous Convictions About Physical Laws
John T. Daly, Los Alamos National Laboratory
9:00 - 9:03AM : Failure in Supercomputers and Supercomputer Storage
Garth Gibson, Carnegie Mellon University / Panasas, Inc.
9:30 - 10:00AM : System-level Checkpoint/Restart with BLCR
Paul Hargrove, Lawrence Berkeley National Laboratory
10:00 - 10:30AM : Coffee Break
10:30 - 11:00AM : Process-Level Fault Tolerance for Job Healing in HPC Environments
Stephen L. Scott, Oak Ridge National Laboratory
11:00 - 11:30AM : A coordinated infrastructure for Fault Tolerant Systems (CIFTS)
Rinku Gupta, Argonne National Laboratory
11:30 - 12:00AM : Towards Support for Fault Tolerance in the MPI Standard
Greg Koenig, Oak Ridge National Laboratory
12:00 - 1:30AM : Lunch Break
1:30 - 2:00PM : Studying Systems as Artifacts
Adam J. Oliner, Stanford University
2:00 - 2:30PM : Combining System Characterization and Novel Execution Models to Achieve Scalable Robust Computing
Jim Brandt, Sandia National Laboratory
2:30 - 3:00PM : Root Cause Analysis
Jon Stearley, Sandia National Laboratory
3:00 - 3:30PM : Coffee Break
3:30 - 4:00PM : Accurate Prediction of Soft Error Vulnerability of Scientific Applications
Greg Bronevetsky, Lawrence Livermore National Laboratory
4:00 - 4:30PM : Modular Redundancy in HPC Systems: Why, Where, When and How?
Christian Engelmann, Oak Ridge National Laboratory
4:30 - 5:00PM : Making Resilience a Reality Through a Resilience Consortium
James Elliott, Louisiana Tech University
5:00 - 5:30PM : Discussion & Closing