Resilience 2008

Overview:
Recent trends in high-performance computing (HPC) systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single-processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of the world's fastest HPC systems increases from today’s current tera-scale to next-generation peta-scale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today’s systems to several hundreds of thousands of compute nodes and more in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to reliability, availability and serviceability (RAS). Serviceability aims toward effective means by which corrective and preventive maintenance can be performed on a system. Higher serviceability improves availability and helps retaining quality, performance and continuity of services at expected levels. Together, the combination of HA, Serviceability, and HPC will clearly lead to even more benefits to critical shared major HEC resource environments.

A recent study performed at Los Alamos National Laboratory estimates the System Mean Time To Failure (SMTTF) for a next-generation peta-scale HPC system. Extrapolating from current HPC system performance, scale, and SMTTF, this study suggests that the system mean-time between failures (SMTBF), i.e., the actual time spent for useful computation between full system recovery and the next failure, will fall to only 1.25 hours on a petaflop machine. The same study also estimates the overhead of the current state-of-the-art fault tolerance strategy, checkpoint/restart, for such a system. The results of this analysis show that a computational job that could normally complete in 100 hours on a failure-free peta-scale HPC system will actually take 251 hours to complete, once cost of failure recovery is included. What this analysis implies is startling: more than 60% of the cycles (and investment) on next-generation peta-scale HPC systems may be lost due to the overhead of dealing with reliability issues, unless something happens to drastically change the current course.

To address the question of computing resiliency, fault tolerance and high availability becomes a critical research topic. The goal of this workshop is to bring together the community in an effort to increase the resiliency of modern computing platforms such that the application mean time to interrupt (MTTI) is significantly greater than the hardware/software mean time between failures (MTBF). More simply put MTTI >> MTBF so that applications will have an opportunity to run to completion without experiencing a significant impact as a result of a computer failure.

Submission Guidelines:
Original, unpublished work is required. The manuscript shall be a maximum of 6 IEEE style pages (two columns, single space, 10 point font), including tables and illustrations. Accepted contributions will be published in the proceedings website and CD which will be available at the workshop. Please send all your submissions by email, in Postscript or PDF format to Dr. Box Leangsuksun, box@latech.edu.
Author(s) of selected papers will be invited to submit the paper for publication in the special issue of the International Journal of Grid and High Performance Computing (IJGHPC)" pressed by IGI publishing by September 15, 2008.

Resilience 2008 topics of interest include, but are not limited to:
• Hardware for fault detection and resiliency.
• System-level resiliency for HPC.
• Statistical methods to improve system resiliency.
• Fault tolerance mechanisms experiments
• Resource management for system resiliency and availability.
• Resilient system based on hardware probes.
• Reliability and Robustness in Grid Computing
• Failure Recovery Strategies in Grid and HPC
• Reliable Communication in Grid and HPC

Workshop General Co-Chairs:
• Stephen L. Scott
Computer Science & Mathematics Division
Oak Ridge National Laboratory
scottsl@ornl.gov

• Chokchai (Box) Leangsuksun
SWEPCO Endowed Associate Professor of Computer Science,
Louisiana Tech University, USA
box@latech.edu

Publication Chair:
• Dr. Hong Ong, Oak Ridge National Lab

Program Committee:
• Ann Gentile, Sandia National Lab
• Bill Yurcik, US Army Research Lab
• Box Leangsuksun, Louisiana Tech University
• Christian Engelmann, Oak Ridge National Laboratory
• Daniel S. Katz, Louisiana State University
• Daniel Stanzione, Jr., Arizona State University
• Frank Mueller, North Carolina State University
• Geoffroy Vallee, Oak Ridge National Laboratory
• George Ostrouchov, Oak Ridge National Laboratory
• Hong Ong, Oak Ridge National Laboratory
• Jim Brandt, Sandia National Laboratories
• John Daly, Los Alamos National Laboratory
• John West,ERDC Major Shared Resource Center
• Mihaela Paun,Louisiana Tech University
• Nathan DeBardeleben, Los Alamos National Laboratory
• Stephen Scott, Oak Ridge National Laboratory
• Thomas Naughton, Oak Ridge National Laboratory
• Xain-He Sun, Illinois Institute of Technology
• Xubin (Ben) He, Tennessee Tech University
• Yung-chin Fang, Dell
• Zhiling Lan, Illinois Institute of Tech

Important Dates:

• Paper Submission Deadline : December 9, 2007 (extended)
• Notification Deadline : January 15, 2008
• Camera Ready Deadline : January 30, 2008