Resilience 2013

6th Workshop on Resiliency in High Performance Computing (Resilience)
in Clusters, Clouds, and Grids in conjunction with the
19th International European Conference on Parallel and Distributed Computing
(Euro-Par 2013),Aachen, Germany, August 26-30, 2013

Overview:

Clusters, Clouds, and Grids are three different computational paradigms with the intent or potential to support High Performance Computing (HPC). Currently, they consist of hardware, management, and usage models particular to different computational regimes, e.g., high performance cluster systems designed to support tightly coupled scientific simulation codes typically utilize high-speed interconnects and commercial cloud systems designed to support software as a service (SAS) do not. However, in order to support HPC, all must at least utilize large numbers of resources and hence effective HPC in any of these paradigms must address the issue of resiliency at large-scale.

Recent trends in HPC systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single- processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of these HPC systems increases from today's tera- and peta-scale to next-generation multi peta-scale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today's systems to several hundreds of thousands of compute nodes and more in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to fault tolerance and resilience.

Furthermore, recent experiences on extreme-scale HPC systems with non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added another major source of concern. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators, such as FPGAs and GPUs, and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as migration, simply fail as random soft errors can't be predicted. Moreover, soft errors may even remain undetected resulting in silent data corruption.

Important websites:
•Resilience 2013 at http://xcr.cenit.latech.edu/resilience2013
•Euro-Par 2013 at http://www.europar2013.org

Prior conferences websites:
•Resilience 2012 at http://xcr.cenit.latech.edu/resilience2012

Important Dates:
•Paper submission deadline on June 10, 2013
•Notification deadline on July 8, 2013
•Camera ready deadline on October 3, 2013

Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format via EasyChair at <https://www.easychair.org/conferences/?conf=resilience2013>. Submitted manuscripts should be structured as technical papers and may not exceed 10 pages, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chair for more information.

Topics of interest include, but are not limited to:
• Reports on current HPC system and application resiliency
• HPC resiliency metrics and standards
• HPC system and application resiliency analysis
• HPC system and application-level fault handling and anticipation
• HPC system and application health monitoring
• Resiliency for HPC file and storage systems
• System-level checkpoint/restart for HPC
• System-level migration for HPC
• Algorithm-based resiliency fundamentals for HPC (not Hadoop)
• Fault tolerant MPI concepts and solutions
• Soft error detection and recovery in HPC systems
• HPC system and application log analysis
• Statistical methods to identify failure root causes
• Fault injection studies in HPC environments
• High availability solutions for HPC systems
• Reliability and availability analysis
• Hardware for fault detection and recovery
• Resource management for system resiliency and availability

General Co-Chairs:

• Stephen L. Scott
Stonecipher/Boeing Distinguished Professor of Computing
Senior Research Scientist - Systems Research Team
Tennessee Tech University and Oak Ridge National Laboratory, USA
scottsl@ornl.gov

• Chokchai (Box) Leangsuksun,
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box@latech.edu

Program Co-Chairs:

• Patrick G. Bridges
University of New Mexico, USA
bridges@cs.unm.edu

• Christian Engelmann
Oak Ridge National Laboratory , USA
engelmannc@ornl.gov

Program Committee:
• Vassil Alexandrov, Barcelona Supercomputing Center, Spain
• Patrick G. Bridges, University of New Mexico
• Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
• Franck Cappello, INRIA/UIUC, France/USA
• Zizhong Chen, University of California, Riverside, USA
• Andrew Chien, University of Chicago, USA
• Nathan DeBardeleben, Los Alamos National Laboratory, USA
• Christian Engelmann, Oak Ridge National Laboratory, USA
• Kurt B. Ferreira, Sandia National Laboratories, USA
• Cecile Germain, University Paris-Sud, France
• Paul Hargrove, Lawrence Berkeley National Laboratory, USA
• Larry Kaplan, Cray, USA
• Dieter Kranzlmueller, LMU/LRZ Munich, Germany
• Sriram Krishnamoorthy, Pacific Northwest National Laboratory, USA
• Chokchai (Box) Leangsuksun, Louisiana Tech University, USA
• Celso Mendes, University of Illinois at Urbana Champaign, USA
• Christine Morin, INRIA Rennes, France
• Alexander Reinefeld, Zuse Institute Berlin, Germany
• Rolf Riesen, IBM Research, Ireland
• Stephen L. Scott, Oak Ridge National Laboratory, USA
• Mark Hoemmen, Sandia National Laboratories, USA

Workshop Program:
Workshop date: Monday August 26, 2013

•11:15
Introduction
Stephen L. Scott

•11:30
"Evaluate the Viability of Application-Driven Cooperative CPU/GPU Fault Detection",
Dong Li, Seyong Lee and Jeffrey Vetter

•12:00
"GPU Behavior on a Large HPC Cluster",
Nathan Debardeleben, Sean Blanchard, Laura Monroe, Phil Romero, Daryl Grunau, Craig Idler and Cornell Wright

•12:30 Lunch break

•14:30
"A Case for Adaptive Redundancy for HPC Resilience",
Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas

•15:00
"Reliable Service Allocation in Clouds with Memory and Capacity Constraints",
Olivier Beaumont, Lionel Eyraud-Dubois, Pierre Pesneau and Paul Renaud-Goud

•15:30
"Model-Driven Resilience Assessment of Modifications to HPC Infrastructures",
Christian Straube and Dieter Kranzlmüller

•16:00 Coffee break

•16:30
"Asking the right questions: benchmarking fault-tolerant extreme-scale systems",
Patrick Widener, Kurt Ferreira, Scott Levy, Patrick Bridges, Dorian Arnold and Ron Brightwell

•17:00
"Using Performance Tools to Support Experiments in HPC Resilience",
Thomas Naughton, Swen Boehm, Christian Engelmann and Geoffroy Vallee

•17:30
Discussion & Closing
Stephen L. Scott