Workshop Venue Social Program Important Dates
Technical Program
4th Workshop on Resiliency in High Performance Computing (Resilience)
in Clusters, Clouds, and Grids in conjunction with the

17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011), Bordeaux, France, August 29 - September 2nd, 2011


Clusters, Clouds, and Grids are three different computational paradigms with the intent or potential to support High Performance Computing (HPC). Currently, they consist of hardware, management, and usage models particular to different computational regimes, e.g., high performance cluster systems designed to support tightly coupled scientific simulation codes typically utilize high-speed interconnects and commercial cloud systems designed to support software as a service (SAS) do not. However, in order to support HPC, all must at least utilize large numbers of resources and hence effective HPC in any of these paradigms must address the issue of resiliency at large-scale.

Recent trends in HPC systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single- processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of these HPC systems increases from today's tera- and peta-scale to next-generation multi peta-scale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today's systems to several hundreds of thousands of compute nodes and more in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to fault tolerance and resilience.

Furthermore, recent experiences on extreme-scale HPC systems with non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added another major source of concern. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators, such as FPGAs and GPUs, and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as migration, simply fail as random soft errors can't be predicted. Moreover, soft errors may even remain undetected resulting in silent data corruption.

Important websites:

Euro-Par 2011 at

Prior conferences websites:
Resilience 2010 at
Resilience 2009 at
Resilience 2008 at

Important Dates:
• Full paper submission deadline on June 27, 2011, 11:59PM US EDT (final extension, hard deadline)
• Notification deadline on July 12, 2011
• Early registration deadline on July 19, 2011
• Resilience Workshop on August 30, 2011
• Euro-Par conference on August 29 - September 2nd, 2011
• Camera ready deadline is after the workshop

Submission Guidelines:
Authors are invited to submit papers electronically in English in PDF format via EasyChair at<>.Submitted manuscripts should be structured as technical papers and may not exceed 10 pages, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <>. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees.Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chair for more information.

Techical Program:

  • Tuesday, August 30, 2011 - 09:30-11:00 - Session 1: Understanding Failures
    • Welcome
      Stephen L. Scott
      Tennessee Tech University (USA) and Oak Ridge National Laboratory (USA)
    • The Malthusian Catastrophe is Upon Us! Are the Largest HPC Machines Ever Up?
      Patricia Kovatch, Matthew Ezell and Ryan Braby
      National Institute for Computational Sciences - The University of Tennessee (USA)
    • Simulating Application Resilience at Exascale
      Rolf Riesen, Kurt Ferreira, Maria Ruiz Varela, Michela Taufer and Arun Rodrigues
      IBM (Ireland), Sandia National Laboratories (USA) and University of Delaware (USA)
    • Framework for Enabling System Understanding
      James Brandt, Frank Chen, Ann Gentile, Chokchai Leangsuksun, Jackson Mayo, Philippe Pebay, Diana Roe, Narate Taerat, David Thompson and Matthew Wong
      Sandia National Laboratories (USA) and Louisiana Tech University (USA)
  • Tuesday, August 30, 2011 - 11:00-11:30 - Coffee Break
  • Tuesday, August 30, 2011 - 11:30-13:00 - Session 2: Soft-Error Resilience
    • Cooperative Application/OS DRAM Fault Recovery
      Patrick Bridges, Mark Hoemmen, Kurt Ferreira, Michael Heroux, Philip Soltero and Ron Brightwell
      University of New Mexico (USA) and Sandia National Laboratories (USA)
    • A Tunable, Software-based DRAM Error Detection and Correction Library for HPC
      David Fiala, Kurt Ferreira, Frank Mueller and Christian Engelmann
      North Carolina State University (USA), Sandia National Laboratories (USA) and Oak Ridge National Laboratory (USA)
    • Reducing the Impact of Soft Errors on Fabric-based Collective Communications
      Jose Carlos Sancho and Jesus Labarta
      Barcelona Supercomputing Center (Spain)
  • Tuesday, August 30, 2011 - 13:00-14:30 - Lunch Break
  • Tuesday, August 30, 2011 - 14:30-16:00 - Session 3: Fault Injection, and Resilience in the Cloud
    • Evaluating application vulnerability to soft errors in multi-level cache hierarchy
      Zhe Ma, Trevor Carlson, Wim Heirman and Lieven Eeckhout
      Imec (Belgium) and Ghent University (Belgium)
    • Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience
      Nathan Debardeleben, Sean Blanchard, Qiang Guan, Ziming Zhang and Song Fu
      Los Alamos National Laboratory (USA) and University of North Texas (USA)
    • High Availability on Cloud with HA-OSCAR
      Thanadech Thanakornworakij, Rajan Sharma, Blaine Scroggs, Chokchai (Box) Leangsuksun, Zeno Dixon Greenwood, Pierre Riteau and Christine Morin
      Louisiana Tech University (USA), University of Rennes 1 - IRISA (France) and INRIA Rennes - Bretagne Atlantique (France)
  • Tuesday, August 30, 2011 - 16:00-16:30 - Coffee Break
  • Tuesday, August 30, 2011 - 16:30-18:00 - Session 4: Checkpoint/Restart
    • On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance
      Dewan Ibtesham, Dorian Arnold, Kurt Ferreira and Patrick Bridges
      University of New Mexico (USA) and Sandia National Laboratories (USA)
    • Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?
      Raghunath Rajachandrasekar, Xiangyong Ouyang, Xavier Besseron, Vilobh Meshram and Dhabaleswar K. Panda
      The Ohio State University (USA)
    • Impact of over-decomposition on coordinated checkpoint/rollback protocol
      Xavier Besseron and Thierry Gautier
      The Ohio State University (USA) and INRIA (France)
    • Closing
      Chokchai (Box) Leangsuksun
      Louisiana Tech University (USA)
  • Tuesday, August 30, 2011 - 18:30-20:00 - Euro-Par Welcome Reception

Topics of interest include, but are not limited to:
Reports on current HPC system and application resiliency
HPC resiliency metrics and standards
HPC system and application resiliency analysis
HPC system and application-level fault handling and anticipation
HPC system and application health monitoring
Resiliency for HPC file and storage systems
System-level checkpoint/restart for HPC
System-level migration for HPC
Algorithm-based resiliency fundamentals for HPC (not Hadoop)
Fault tolerant MPI concepts and solutions
Soft error detection and recovery in HPC systems
HPC system and application log analysis
Statistical methods to identify failure root causes
Fault injection studies in HPC environments
High availability solutions for HPC systems
Reliability and availability analysis
Hardware for fault detection and recovery
Resource management for system resiliency and availability

General Co-Chairs:

Stephen L. Scott
Stonecipher/Boeing Distinguished Professor of Computing
Tennessee Tech University, USA
Oak Ridge National Laboratory, USA

Chokchai (Box) Leangsuksun,
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA

Program Chair:

Christian Engelmann
Oak Ridge National Laboratory , USA

Publication Co-Chairs:

James Brandt
Sandia National Laboratories, USA

Ann Gentile
Sandia National Laboratories, USA

Program Committee:
• Vassil Alexandrov, Barcelona Supercomputing Center, Spain
• David E. Bernholdt, Oak Ridge National Laboratory, USA
• George Bosilca, University of Tennessee, USA
• Jim Brandt, Sandia National Laboratories, USA
• Patrick G. Bridges, University of New Mexico
• Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
• Franck Cappello, INRIA/UIUC, France/USA
• Kasidit Chanchio, Thammasat University, Thailand
• Zizhong Chen, Colorado School of Mines, USA
• Nathan DeBardeleben, Los Alamos National Laboratory, USA
• Jack Dongarra, University of Tennessee, USA
• Christian Engelmann, Oak Ridge National Laboratory, USA
• Yung-Chin Fang, Dell, USA
• Kurt B. Ferreira, Sandia National Laboratories, USA
• Ann Gentile, Sandia National Laboratories, USA
• Cecile Germain, University Paris-Sud, France
• Rinku Gupta, Argonne National Laboratory, USA
• Paul Hargrove, Lawrence Berkeley National Laboratory, USA
• Xubin He, Virginia Commonwealth University, USA
• Larry Kaplan, Cray, USA
• Daniel S. Katz, University of Chicago, USA
• Thilo Kielmann, Vrije Universiteit Amsterdam, Netherlands
• Dieter Kranzlmueller, LMU/LRZ Munich, Germany
• Zhiling Lan, Illinois Institute of Technology, USA
• Chokchai (Box) Leangsuksun, Louisiana Tech University, USA
• Xiaosong Ma, North Carolina State University, USA
• Celso Mendes, University of Illinois at Urbana Champaign, USA
• Christine Morin, INRIA Rennes, France
• Thomas Naughton, Oak Ridge National Laboratory, USA
• George Ostrouchov, Oak Ridge National Laboratory, USA
• DK Panda, The Ohio State University, USA
• Mihaela Paun, Louisiana Tech University, USA
• Alexander Reinefeld, Zuse Institute Berlin, Germany
• Rolf Riesen, IBM Research, Ireland
• Eric Roman, Lawrence Berkeley National Laboratory, USA
• Stephen L. Scott, Oak Ridge National Laboratory, USA
• Jon Stearley, Sandia National Laboratories, USA
• Gregory M. Thorson, SGI, USA
• Geoffroy Vallee, Oak Ridge National Laboratory, USA
• Sudharshan Vazhkudai, Oak Ridge National Laboratory, USA