Workshop on Resiliency in High Performance Computing (RESILIENCE 2008)

http://xcr.cenit.latech.edu/resilience2008/

 

HAPCW Co-Chairs

Stephen L. Scott, Oak Ridge National Lab and Chokchai Box Leangsuksun, Louisiana Tech University

Publication Chair

Hong Ong, Oak Ridge National Laboratory



Title

Authors


10:30-12:30

Welcome

Stephen Scott and Box Leangsuksun

Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations

Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin He

 

Performance and Availability Tradeoffs in Replicated File Systems

 

Jiaying Zhang and Peter Honeyman

 

A Technique for Lock-less Mirroring in Parallel File Systems

Bradley W. Settlemyer and Walter B. Ligon III

 

 

14:00-16:00

 

Application MTTFE vs. Platform MTTF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale

J.T. Daly, L.A. Pritchett-Sheats, and S.E. Michalak

 

Application Resilience: Making Progress in Spite of Failure

William M. Jones, John T. Daly, and Nathan A. DeBardeleben

Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Gopi Kandaswamy, Anirban Mandal, and Daniel A. Reed

Fault Tolerance in Cluster Federations with O2P-CF

Thomas Ropars and Christine Morin


16:30-18:30


Reliability-aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

Nichamon Naksinehaboon, Yudan Liu, Chokchai (Box) Leangsuksun, Raja Nassar, Mihaela Paun, and Stephen L. Scott

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Ann Gentile, Jim Brandt, Philippe Pebay, David Thompson, Matthew Wong, Bert Debusschere, and Jackson Mayo

Bad Words: Finding Faults in Spirit's Syslog

Jon Stearley

Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Gopi Kandaswamy, Anirban Mandal, and Daniel A. Reed