5th Workshop on Resiliency in High Performance Computing (Resilience)
in Clusters, Clouds, and Grids
in conjunction with the
18th International European Conference on Parallel and
Distributed Computing (Euro-Par 2012),
Rhodes Island, Greece, August 27th - August 31st, 2012
Overview:
Clusters, Clouds, and Grids are three different computational paradigms
with the intent or potential to support High Performance Computing (HPC).
Currently, they consist of hardware, management, and usage models particular
to different computational regimes, e.g., high performance cluster systems
designed to support tightly coupled scientific simulation codes typically
utilize high-speed interconnects and commercial cloud systems designed to
support software as a service (SAS) do not. However, in order to support HPC,
all must at least utilize large numbers of resources and hence effective HPC
in any of these paradigms must address the issue of resiliency at large-scale.
Recent trends in HPC systems have clearly indicated that future increases in
performance, in excess of those resulting from improvements in single- processor
performance, will be achieved through corresponding increases in system scale,
i.e., using a significantly larger component count. As the raw computational
performance of these HPC systems increases from today's tera- and peta-scale to
next-generation multi peta-scale capability and beyond, their number of computational,
networking, and storage components will grow from the ten-to-one-hundred thousand
compute nodes of today's systems to several hundreds of thousands of compute nodes
and more in the foreseeable future. This substantial growth in system scale, and the
resulting component count, poses a challenge for HPC system and application software
with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with non-recoverable soft
errors, i.e., bit flips in memory, cache, registers, and logic added another major source
of concern. The probability of such errors not only grows with system size, but also with
increasing architectural vulnerability caused by employing accelerators, such as FPGAs and
GPUs, and by shrinking nanometer technology. Reactive fault tolerance technologies, such as
checkpoint/restart, are unable to handle high failure rates due to associated overheads,
while proactive resiliency technologies, such as migration, simply fail as random soft
errors can't be predicted. Moreover, soft errors may even remain undetected resulting in
silent data corruption.
Important websites:
•
Euro-Par 2012 at http://europar2012.cti.gr/
Prior conferences websites:
•
Resilience 2011 at http://xcr.cenit.latech.edu/resilience2011
•
Resilience 2010 at http://xcr.cenit.latech.edu/resilience2010 •
Resilience 2009 at http://xcr.cenit.latech.edu/resilience2009 •
Resilience 2008 at http://xcr.cenit.latech.edu/resilience2008
Important
Dates:
• Full paper submission deadline on June 15, 2012, 11:59PM US EDT
• Notification deadline on July 8, 2012
• Euro-Par conference on August 27 - August 31, 2012
• Camera ready deadline is after the workshop
Submission
Guidelines:
Authors are invited to
submit papers electronically in English in PDF format via EasyChair
at<https://www.easychair.org/conferences/?conf=resilience2012">.Submitted manuscripts should be structured as technical papers and may not exceed 10 pages, including figures, tables
and references, using Springer's Lecture Notes in Computer Science (LNCS) format at <http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>.
Submissions should include abstract, key words and the e-mail address
of the corresponding author. Papers not conforming to these guidelines
may be returned without review. All manuscripts will be reviewed and
will be judged on correctness, originality, technical strength,
significance, quality of presentation, and interest and relevance to the
conference attendees.Submitted papers must represent original unpublished
research that is not currently under review for any other conference or
journal. Papers not following these guidelines will be rejected without
review and further action may be taken, including (but not limited to)
notifications sent to the heads of the institutions of the authors and
sponsors of the conference. Submissions received after the due date,
exceeding length limit, or not appropriately structured may also not be
considered. The proceedings will be published in Springer's LNCS as
post-conference proceedings. At least one author of an accepted paper must
register for and attend the workshop for inclusion in the proceedings.
Authors may contact the workshop program chair for more information.
Technical Program:
- 09:30-10:00:
-
"Welcome"
*Stephen L. Scott*
Tennessee Tech University/Oak Ridge National Laboratory (USA)
-
10:00-11:00 Invited Talks:
-
"Chaotic-identity Maps for Robustness Estimation of Exascale Computations"
*Nageswara Rao*
Oak Ridge National Laboratory (USA)
-
"Programming Model Extensions for Resilience at Extreme Scale"
*Saurabh Hukerikar*
University of Southern California (USA)
-
11:30-13:00 Papers:
-
"High Performance Reliable File Transfers Using Automatic Many-to-Many Parallelization"
*Paul Kolano*
NASA Ames Research Center (USA)
-
"A Reliability Model for Cloud Computing for High Performance Computing Applications"
Thanadech Thanakornworakij, Raja Nassar, *Chokchai Leangsuksun*, and Mihaela Paun
Louisiana Tech University (USA)
-
"The Viability of Using Compression to Decrease Message Log Sizes"
Kurt Ferreira, Rolf Riesen, *Dorian Arnold*, Dewan Ibtesham, and Ron Brightwell
Sandia National Laboratories (USA), IBM Research (Ireland), and University of New Mexico (USA)
-
14:30-16:00 Invited Talks:
-
"Recent Efforts in Fault Tolerant MPI Standardization"
*Wesley Bland*
University of Tennessee (USA)
-
"Does Partial Replication Pay Off?"
*Dorian Arnold*
University of New Mexico (USA)
-
"Resiliency: Going Forward"
*Chokchai Leangsuksun*
Louisiana Tech University (USA)
Topics of interest include, but are not limited to: • Reports on current HPC system and application resiliency
• HPC resiliency metrics and standards
• HPC system and application resiliency analysis
• HPC system and application-level fault handling and anticipation
• HPC system and application health monitoring
• Resiliency for HPC file and storage systems
• System-level checkpoint/restart for HPC
• System-level migration for HPC
• Algorithm-based resiliency fundamentals for HPC (not Hadoop)
• Fault tolerant MPI concepts and solutions
• Soft error detection and recovery in HPC systems
• HPC system and application log analysis
• Statistical methods to identify failure root causes
• Fault injection studies in HPC environments
• High availability solutions for HPC systems
• Reliability and availability analysis
• Hardware for fault detection and recovery
• Resource management for system resiliency and availability
General Co-Chairs:
• Stephen
L. Scott
Stonecipher/Boeing Distinguished Professor of Computing
Senior Research Scientist - Systems Research Team
Tennessee Tech University and Oak Ridge National Laboratory, USA
scottsl@ornl.gov
• Chokchai
(Box) Leangsuksun,
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box@latech.edu
Program Chair:
• Christian Engelmann
Oak Ridge National Laboratory , USA
engelmannc@ornl.gov
Publication Co-Chairs:
• James Brandt
Sandia National Laboratories, USA
brandt@sandia.gov
• Ann Gentile
Sandia National Laboratories, USA
gentile@sandia.gov
Program Committee:
• Vassil Alexandrov, Barcelona Supercomputing Center, Spain
• David E. Bernholdt, Oak Ridge National Laboratory, USA
• George Bosilca, University of Tennessee, USA
• Jim Brandt, Sandia National Laboratories, USA
• Patrick G. Bridges, University of New Mexico
• Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
• Franck Cappello, INRIA/UIUC, France/USA
• Zizhong Chen, Colorado School of Mines, USA
• Nathan DeBardeleben, Los Alamos National Laboratory, USA
• Christian Engelmann, Oak Ridge National Laboratory, USA
• Yung-Chin Fang, Dell, USA
• Kurt B. Ferreira, Sandia National Laboratories, USA
• Ann Gentile, Sandia National Laboratories, USA
• Cecile Germain, University Paris-Sud, France
• Rinku Gupta, Argonne National Laboratory, USA
• Paul Hargrove, Lawrence Berkeley National Laboratory, USA
• Xubin He, Virginia Commonwealth University, USA
• Daniel S. Katz, University of Chicago, USA
• Larry Kaplan, Cray, USA
• Thilo Kielmann, Vrije Universiteit Amsterdam, Netherlands
• Dieter Kranzlmueller, LMU/LRZ Munich, Germany
• Chokchai (Box) Leangsuksun, Louisiana Tech University, USA
• Xiaosong Ma, North Carolina State University, USA
• Celso Mendes, University of Illinois at Urbana Champaign, USA
• Christine Morin, INRIA Rennes, France
• Thomas Naughton, Oak Ridge National Laboratory, USA
• George Ostrouchov, Oak Ridge National Laboratory, USA
• Mihaela Paun, Louisiana Tech University, USA
• Alexander Reinefeld, Zuse Institute Berlin, Germany
• Rolf Riesen, IBM Research, Ireland
• Stephen L. Scott, Oak Ridge National Laboratory, USA
• Gregory M. Thorson, SGI, USA
• Geoffroy Vallee, Oak Ridge National Laboratory, USA
• Sudharshan Vazhkudai, Oak Ridge National Laboratory, USA
• Chao Wang, Oak Ridge National Laboratory, USA
|