8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

held in conjunction with the

21st International European Conference on Parallel and Distributed Computing (Euro-Par) 2015

Vienna, Austria, August 24-28, 2015

<https://www.csm.ornl.gov/srt/conferences/Resilience/2015>

Overview

Clouds, Grids, and Clusters are three different computational paradigms with the potential to support High Performance Computing (HPC) and enterprise IT infrastructure. Currently, they consist of hardware, management, and usage models particular to different computational regimes (e.g., high performance cluster systems designed to support tightly coupled scientific simulation codes typically utilize high-speed interconnects and commercial cloud systems designed to support software as a service (SAS) typically do not). However, in order to support HPC, all must at least utilize large numbers of resources and hence effective HPC in any of these paradigms must address the same issue of resiliency at a very large-scale.

Recent trends in high-performance computing (HPC) systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single-processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of the world's fastest HPC systems increases from today's current multi-petascale to next-generation exascale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today's systems to several hundreds of thousands of compute nodes in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to reliability, availability and serviceability (RAS).

The expected total component count of these HPC systems calls into questions many of today's HPC RAS assumptions. Although the mean-time to failure (MTTF) for each individual component, e.g., processor, memory module, and network interface, may be above typical consumer product standard, the probability of failure for the overall system scales proportionally to the number of interdependent components and their combined probabilities of failure. Thus, the enormous number of individual components results in a much lower system mean-time to failure (SMTTF), causing more frequent system-wide interruptions than displayed by current HPC systems. This effect is not limited to hardware components, but also extends to software components, e.g., operating system, system software, and applications. Although software components do not show less reliability with increasing age like hardware components, they do contain other sources of failures, such as design and implementation errors. Furthermore, the health of software components also involves resource utilization, such as processor, memory and network usage.

To address the issue of computing resiliency, fault tolerance and high availability have become critical research topics. The goal of this workshop is to bring together the community in an effort to facilitate resilient HPC in each of these three computational paradigms -- Clouds, Grids, and Clusters. Their respective differences in architecture, management, and usage models may lend themselves to different approaches to resiliency. Knowledge of these approaches in one may be used to enable resiliency in the others or to define new usage models to enable HPC. This workshop targets fundamental solutions and issues in resiliency for HPC.

Submission Guidelines

Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and may not exceed 12 pages, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.

Important Web sites

Resilience 2015 Website: <https://www.csm.ornl.gov/srt/conferences/Resilience/2015>
Resilience 2015 Submissions: <https://easychair.org/conferences/?conf=europar2015ws>
Euro-Par 2015 website: <http://www.europar2015.org>

Topics

Topics of interest include, but are not limited to:

Hardware for fault detection and resiliency
System-level resiliency for HPC, Grid, Cluster, and Cloud
Algorithmic based resiliency - Generic, fundamental advances (not Hadoop)
Statistical methods to improve system resiliency
Fault tolerance mechanisms experiments
Resource management for system resiliency and availability
Resilient system based on hardware probes
Monitoring mechanisms to support fault prediction, and fault mitigation
Application-level fault tolerance
Fault prediction and failure modeling

Important Dates

Workshop papers due: June 5, 2015 (extended)
Workshop author notification: June 30, 2015
Workshop early registration: July 17, 2015
Workshop paper (for informal workshop proceedings): July 31, 2015
Workshop camera-ready papers: October 2, 2015

Workshop Chairs

Stephen L. Scott
Senior Research Scientist - Systems Research Team
Tennessee Tech University and Oak Ridge National Laboratory, USA
scottsl@ornl.gov
Chokchai (Box) Leangsuksun
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box@latech.edu

Workshop Program Chairs

Patrick G. Bridges
University of New Mexico, USA
bridges@cs.unm.edu
Christian Engelmann
Oak Ridge National Laboratory, USA
engelmannc@ornl.gov

Program Committee

Ferrol Aderholdt, Tennessee Tech University, USA
Dorian Arnold, University of New Mexico, USA
Wesley Bland, Intel Corporation, USA
Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
Franck Cappello, Argonne National Laboratory and University of Illinois at Urbana-Champaign, USA
Zizhong Chen, University of California at Riverside, USA
Andrew A. Chien, University of Chicago and Argonne National Laboratory, USA
Nathan DeBardeleben, Los Alamos National Laboratory, USA
James Elliott, North Carolina State University, USA
Kurt Ferreira, Sandia National Laboratory, USA
Michael Heroux, Sandia National Laboratories, USA
Larry Kaplan, Cray Inc., USA
Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany
Sriram Krishnamoorthy, Pacific Northwest National Laboratory, USA
Ignacio Laguna, Lawrence Livermore National Laboratory, USA
Scott Levy, University of New Mexico, USA
Celso Mendes, University of Illinois at Urbana-Champaign, USA
Kathryn Mohror, Lawrence Livermore National Laboratory, USA
Christine Morin, INRIA Rennes, France
Nageswara Rao, Oak Ridge National Laboratory, USA
Alexander Reinefeld, Zuse Institute Berlin, Germany
Rolf Riesen, Intel Corporation, USA
Martin Schulz, Lawrence Livermore National Laboratory, USA
Marc Snir, Argonne National Laboratory, USA
Keita Teranishi, Sandia National Laboratories, USA

Program

09:00 - 10:30 Session 1:
- 09:00 - 9:30 Opening: Stephen L. Scott.
- 09:30 - 10:30 Keynote: Christian Engelmann. Toward A Fault Model And Resilience Design Patterns For Extreme Scale Systems
10:30 - 11:00 Coffee Break
11:00 - 12:30 Session 2:
- 11:00 - 11:30 Alina Sirbu and Ozalp Babaoglu. A Holistic Approach To Log Data Analysis In High-Performance Computing Systems: The Case Of IBM Blue Gene/Q.
- 11:30 - 12:00 Patrick Widener, Kurt Ferreira, Scott Levy and Nathan Fabian. Canaries In A Coal Mine: Using Application-Level Checkpoints To Detect Memory Failures.
- 12:00 - 12:30 Anshu Dubey, Hajime Fujita, Zachary Rubenstein, Brian Van Straalen and Andrew Chien. A Case Study Of Application Structure Aware Resilience Through Differentiated State Saving And Recovery.
12:30 - 14:30 Lunch
14:30 - 16:00 Session 3:
- 14:30 - 15:00 Tatiana Martsinkevich, Thomas Ropars and Franck Cappello. Addressing The Last Roadblock For Message Logging In HPC: Alleviating The Memory Requirement Using Dedicated Resources.
- 15:00 - 15:30 Aiman Fang, Hajime Fujita and Andrew Chien. Towards Understanding Post-Recovery Efficiency For Shrinking And Non-Shrinking Recovery.
- 15:30 - 16:00 Waleed Aloriny and Chris Guy. An Advanced Fault-Tolerant Architecture For IP Routers.
16:00 - 16:30 Coffee Break
16:30 - 18:00 Session 4:
- 16:30 - 17:30 Discussion: Future Directions For HPC Resilience Research
- 17:30 - 18:00 Closing: Stephen L. Scott.