Resilience 2009

Overview:

Recent trends in high-performance computing (HPC) systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single-processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of the world's fastest HPC systems increases from today's current tera-scale to next-generation peta-scale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today's systems to several hundreds of thousands of compute nodes and more in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to fault tolerance and resilience.

Furthermore, recent experiences on extreme-scale HPC systems with non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added another major source of concern. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators, such as FPGAs and GPUs, and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as preemptive migration, simply fail as random soft errors can't be predicted. Moreover, soft errors may even remain undetected resulting in silent data corruption.

The goal of this Workshop is to bring together experts in the area of fault tolerance and resiliency for HPC to present the latest achievements and to discuss the challenges ahead. Accepted papers will be included with the HPDC conference proceedings published by ACM. Resilience 2009 is the follow-on to the successful Resilience 2008 workshop held in conjunction with CCGrid in Lyon, France.

Submission Guidelines:
Original, unpublished work is required. Submissions shall be a maximum of 10 ACM SIG style pages (http://www.acm.org/sigs/publications/proceedings-templates), including tables and illustrations. All submitted manuscripts will be reviewed by a distinguished international program committee. Accepted contributions will be published with the HPDC conference proceedings through ACM. Papers should be submitted electronically via https://ssl.linklings.net/conferences/hpdc.

Topics of interest include, but are not limited to:
• Reports on current HPC system and application resiliency
• HPC resiliency metrics and standards
• HPC system and application resiliency analysis
• HPC system and application-level fault handling and anticipation
• HPC system and application health monitoring
• Resiliency for HPC file and storage systems
• System-level checkpoint/restart for HPC
• System-level preemptive migration for HPC
• Algorithm-based resiliency for HPC
• Fault tolerant MPI concepts and solutions
• Soft error detection and recovery in HPC systems
• HPC system and application log analysis
• Statistical methods to identify failure root causes
• Fault injection studies in HPC environments
• High availability solutions for HPC systems
• Reliability and availability analysis
• Hardware for fault detection and recovery

Workshop General Co-Chairs:
• Stephen L. Scott
Computer Science & Mathematics Division
Oak Ridge National Laboratory
scottsl@ornl.gov

• Chokchai (Box) Leangsuksun
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box@latech.edu

Program Chair:
• Christian Engelmann
Computer Science and Mathematics Division
Oak Ridge National Laboratory
engelmannc@ornl.gov

Program Committee:
• Ann Gentile, Sandia National Laboratory, USA
• Aurelien Bouteiller, University of Tennessee, USA
• Chokchai (Box) Leangsuksun, Louisiana Tech University, USA
• Christian Engelmann, Oak Ridge National Laboratory, USA
• Daniel S. Katz, Louisiana State University, USA
• Dan Stanzione, Arizona State University, USA
• Franck Cappello, INRIA, France
• Geoffroy Vallee, Oak Ridge National Laboratory, USA
• George Bosilca, University of Tennessee, USA
• George Ostrouchov, Oak Ridge National Laboratory, USA
• Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
• Gregory M. Thorson, SGI, USA
• Hong Ong, Oak Ridge National Laboratory, USA
• Jim Brandt, Sandia National Laboratory, USA
• John T. Daly, Center for Exceptional Computing, USA
• Jon Stearley, Sandia National Laboratory, USA
• Li Ou, Dell, USA
• Mihaela Paun, Louisiana Tech University, USA
• Nathan DeBardeleben, Los Alamos National Laboratory, USA
• Paul Hargrove, Lawrence Berkeley National Laboratory, USA
• Stephen Poole, Oak Ridge National Laboratory, USA
• Stephen L. Scott, Oak Ridge National Laboratory, USA
• Sudharshan Vazhkudai, Oak Ridge National Laboratory, USA
• Thomas Naughton, Oak Ridge National Laboratory, USA
• Tong Liu, Mellanox, USA
• Xian-He Sun, Illinois Institute of Technology, USA
• Xubin (Ben) He, Tennessee Tech University, USA
• Yung-Chin Fang, Dell, USA
• Zhiling Lan, Illinois Institute of Technology, USA

Important Dates:
• Paper Submission Deadline : March 4, 2009
• Notification Deadline : March 18, 2009
• Camera Ready Deadline : April 2, 2009