9th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

held in conjunction with the

22nd International European Conference on Parallel and Distributed Computing (Euro-Par) 2016

Grenoble, France, August 22-26, 2016

<https://www.csm.ornl.gov/srt/conferences/Resilience/2016>

Overview

Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), and software complexity increases. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.

While a fault (e.g., a bug or stuck bit) is the cause of an error, its manifestation as a state change is considered an error (e.g., a bad value or incorrect execution), and the transition to an incorrect service is observed as a failure (e.g., an application abort or system crash). A failure in a computing system is typically observed through an application abort or a full/partial service or system outage. A detectable correctable error is often transparently handled by hardware, such as a single bit flip in memory that is protected with single-error correction double-error detection (SECDED) error correcting code (ECC). A detectable uncorrectable error (DUE) typically results in a failure, such as multiple bit flips in the same addressable word that escape SECDED ECC correction, but not detection, and ultimately cause an application abort. An undetectable error (UE) may result in silent data corruption (SDC), e.g., an incorrect application output. There are many other types of hardware and software faults, errors, and failures in computing systems.

Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.

Submission Guidelines

Authors are invited to submit papers electronically in English in PDF format. Submitted manuscripts should be structured as technical papers and may not exceed 12 pages, including figures, tables and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>. Submissions should include abstract, key words and the e-mail address of the corresponding author. Papers not conforming to these guidelines may be returned without review. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. The proceedings will be published in Springer's LNCS as post-conference proceedings. At least one author of an accepted paper must register for and attend the workshop for inclusion in the proceedings. Authors may contact the workshop program chairs for more information.

Important Web sites

Resilience 2016 Website: <https://www.csm.ornl.gov/srt/conferences/Resilience/2016>
Resilience 2016 Submissions: <https://easychair.org/conferences/?conf=europar2016ws>
Euro-Par 2016 website: <http://europar2016.inria.fr>

Topics

Topics of interest include, but are not limited to:

Theoretical foundations for resilience:
- Metrics and measurement
- Statistics and optimization
- Simulation and emulation
- Formal methods
- Efficiency modeling and uncertainty quantification
Fault detection and prediction:
- Statistical analyses
- Machine learning
- Anomaly detection
- Data and information collection
- Vizualization
Monitoring and control for resilience:
- Platform and application monitoring
- Response and recovery
- RAS theory and performability
- Application and platform knobs
- Tunable fidelity and quality of service
End-to-end data integrity:
- Fault tolerant design
- Degraded modes
- Forward migration and verification
- Fault injection
- Soft errors
- Silent data corruption
Enabling infrastructure for resilience:
- RAS systems
- System software and middleware
- Programming models
- Tools
- Next-generation architectures
Resilient solvers and algorithm-based fault tolerance:
- Algorithmic detection and correction of hard and soft faults
- Resilient algorithms
- Fault tolerant numerical methods
- Robust iterative algorithms
- Scalability of resilient solvers and algorithm-based fault tolerance

Important Dates

Workshop papers due: May 25, 2016
Workshop author notification: June 17, 2016
Workshop early registration: July 4, 2016
Workshop paper (for informal workshop proceedings): July 20, 2016
Workshop date: August 23, 2016
Workshop camera-ready papers: October 3, 2016

Workshop Chairs

Stephen L. Scott
Senior Research Scientist - Systems Research Team
Tennessee Tech University and Oak Ridge National Laboratory, USA
scottsl@ornl.gov
Chokchai (Box) Leangsuksun
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box@latech.edu

Workshop Program Chairs

Patrick G. Bridges
University of New Mexico, USA
bridges@cs.unm.edu
Christian Engelmann
Oak Ridge National Laboratory, USA
engelmannc@ornl.gov

Program Committee

Ferrol Aderholdt, Oak Ridge National Laboratory, USA
Vassil Alexandrov, Barcelona Supercomputer Center, Spain
Dorian Arnold, University of New Mexico, USA
Wesley Bland, Intel Corporation, USA
Hans-Joachim Bungartz, Technical University of Munich, Germany
Franck Cappello, Argonne National Laboratory and University of Illinois at Urbana-Champaign, USA
Marc Casas, Barcelona Supercomputer Center, Spain
Zizhong Chen, University of California at Riverside, USA
Robert Clay, Sandia National Laboratories, USA
Miguel Correia, Universidade de Lisboa, Portugal
Nathan DeBardeleben, Los Alamos National Laboratory, USA
James Elliott, Sandia National Laboratories, USA
Kurt Ferreira, Sandia National Laboratories, USA
Michael Heroux, Sandia National Laboratories, USA
Larry Kaplan, Cray Inc., USA
Dieter Kranzlmueller, Ludwig-Maximilians University of Munich, Germany
Sriram Krishnamoorthy, Pacific Northwest National Laboratory, USA
Ignacio Laguna, Lawrence Livermore National Laboratory, USA
Scott Levy, University of New Mexico, USA
Kathryn Mohror, Lawrence Livermore National Laboratory, USA
Christine Morin, INRIA Rennes, France
Dirk Pflueger, University of Stuttgart, Germany
Nageswara Rao, Oak Ridge National Laboratory, USA
Alexander Reinefeld, Zuse Institute Berlin, Germany
Rolf Riesen, Intel Corporation, USA
Yves Robert, ENS Lyon, France
Thomas Ropars, Universite Grenoble Alpes, France
Martin Schulz, Lawrence Livermore National Laboratory, USA
Keita Teranishi, Sandia National Laboratories, USA

Program

09:00 - 10:30 Session 1:
- 09:00 - 9:30 Opening: Stephen L. Scott.
- 09:30 - 10:00 Laura Monroe, William Jones, Scott Lavigne, Claude Davis, Qiang Guan and Nathan Debardeleben. On the Inherent Resilience of Integer Operations.
- 10:00 - 10:30 Mario Heene, Alfredo Parra Hinojosa, Dirk Pflüger and Hans-Joachim Bungartz. A Massively-Parallel, Fault-Tolerant Solver for Time-Dependent PDEs in High Dimensions. (Presentation)
10:30 - 11:00 Coffee Break
11:00 - 12:45 Session 2:
- 11:00 - 11:30 Patrick Widener, Kurt Ferreira and Scott Levy. Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols. (Presentation)
- 11:30 - 12:00 Pedro Diniz, Chunhua Liao, Daniel Quinlan and Robert Lucas. Pragma-controlled Source-to-Source Code Transformations for Robust Application Execution. (Presentation)
- 12:00 - 12:30 Thomas Naughton, Christian Engelmann, Geoffroy Vallee, Ferrol Aderholdt and Stephen Scott. A Cooperative Approach to Virtual Machine Based Fault Injection. (Presentation)
- 12:30 - 12:45 Closing: Stephen L. Scott.
12:45 - 14:00 Lunch