banner

Registration Tech Program Accommodations Transportation Important Dates
3rd Workshop on Resiliency in High Performance Computing (Resilience)
in Clusters, Clouds, and Grids

Overview
:

Clusters, Clouds, and Grids are three different computational paradigms with the intent or potential to support High Performance Computing (HPC). Currently, they consist of hardware, management, and usage models particular to different computational regimes, e.g., high performance cluster systems designed to support tightly coupled scientific simulation codes typically utilize high-speed interconnects and commercial cloud systems designed to support software as a service (SAS) do not. However, in order to support HPC, all must at least utilize large numbers of resources and hence effective HPC in any of these paradigms must address the issue of resiliency at large-scale.

Recent trends in HPC systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single- processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of these HPC systems increases from today's tera- and peta-scale to next-generation multi peta-scale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today's systems to several hundreds of thousands of compute nodes and more in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to fault tolerance and resilience.

Furthermore, recent experiences on extreme-scale HPC systems with non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added another major source of concern. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators, such as FPGAs and GPUs, and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as migration, simply fail as random soft errors can't be predicted. Moreover, soft errors may even remain undetected resulting in silent data corruption.

Resilience 2010 is the follow-on workshop to the successful Resilience 2009 held with HPDC in Munich, Germany, and the earlier Resilience 2008 held in conjunction with CCGrid in Lyon, France.


Tech Program
:

09:00 AM – 9:30 AM   Welcome/Introduction Christian Engelmann, Workshop Program Chair

09:30 AM – 10:00 AM   "Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU" Imran S. Haque and Vijay S. Pande

10:00 AM – 10:30 AM   "Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems"James Brandt,Ann Gentile, Frank Chen,Vincent De Sapio, Jackson Mayo, Philippe Pebay, Diana Roe, David Thompson and Matthew Wong
                                  
10:30 AM – 11:00 AM    Break

11:00 AM – 11:30 AM    "Team-based Message Logging" Esteban Meneses, Celso Mendes and Laxmikant Kale

11:30 AM – 12:00 PM    "Selective Recovery From Failures In A Task Parallel Programming Model" James Dinan, Sriram Krishnamoorthy, Arjun Singri and P. Sadayappan

12:00 PM – 12:50 PM      Discussion: "Towards Resilience Standardization" Chokchai (Box) Leangsuksun, Workshop Co-Chair

12:50 PM – 01:00 PM      Closing Christian Engelmann, Workshop Program Chair

1:00 PM – 02:00 PM        Lunch


Important websites:

Resilience 2010  : http://xcr.cenit.latech.edu/resilience2010
CCGrid 2010      : http://www.manjrasoft.com/ccgrid2010

Prior conferences websites:
Resilience 2009 : http://xcr.cenit.latech.edu/resilience2009
Resilience 2008 : http://xcr.cenit.latech.edu/resilience2008


Important Dates:

Paper submission deadline : December 30, 2009 (Extension)
• Notification deadline : January 11, 2010
• Camera ready deadline : February 5, 2010


Submission Guidelines:
Authors are invited to submit papers electronically. Submitted manuscripts should be structured as technical papers and may not exceed 6 letter size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings (print area of 6-1/2 inches (16.51 cm) wide by 8-7/8 inches (22.51 cm) high, two-column format with columns 3-1/16 inches (7.85 cm) wide with a 3/8 inch (0.81 cm) space between them, single-spaced 10-point Times fully justified text). Submissions not conforming to these guidelines may be returned without review. Authors should submit the manuscript in PDF format andmake sure that the file will print on a printer that uses letter size(8.5 x 11) paper. The official language of the meeting is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees.

Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers
not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. At least one author of an accepted paper must register for and attend the workshop. Authors may contact the workshop program chair for more information. The proceedings will be published through the IEEE Computer Society Press, USA and will be made online through the IEEE Digital Library.

Papers should be submitted electronically in the IEEE conference proceedings style as PDF to the workshop submission Web site at <http://www.easychair.org/conferences/?conf=resilience2010>. For manuscript preparation with LaTeX, use the newer unofficial CTAN from <http://www.ctan.org/tex-archive/macros/latex/contrib/IEEEconf> or the older official IEEE conference proceedings template available at <ftp://pubftp.computer.org/Press/Outgoing/proceedings/IEEE_CS_Latex8.5x11.zip>.
For Microsoft Word, use the official proceedings template available at <ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.doc>.


Topics of interest include, but are not limited to:
Reports on current HPC system and application resiliency
HPC resiliency metrics and standards
HPC system and application resiliency analysis
HPC system and application-level fault handling and anticipation
HPC system and application health monitoring
Resiliency for HPC file and storage systems
System-level checkpoint/restart for HPC
System-level migration for HPC
Algorithm-based resiliency fundamentals for HPC (not Hadoop)
Fault tolerant MPI concepts and solutions
Soft error detection and recovery in HPC systems
HPC system and application log analysis
Statistical methods to identify failure root causes
Fault injection studies in HPC environments
High availability solutions for HPC systems
Reliability and availability analysis
Hardware for fault detection and recovery
Resource management for system resiliency and availability


General Co-Chairs:
Stephen L. Scott
Computer Science and Mathematics Division
Oak Ridge National Laboratory, USA
scottsl@ornl.gov
 
Chokchai (Box) Leangsuksun,
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box@latech.edu


Program Chair:
Christian Engelmann
Computer Science and Mathematics Division
Oak Ridge National Laboratory , USA
engelmannc@ornl.gov


Publication Co-Chairs:
James Brandt
Sandia National Laboratories, USA
brandt@sandia.gov

Ann Gentile
Sandia National Laboratories, USA
gentile@sandia.gov



Program Committee:
James Brandt, Sandia National Laboratories, USA
George Bosilca, University of Tennessee, USA
Aurelien Bouteiller, University of Tennessee, USA
Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
Franck Cappello, INRIA Paris, France
Kasidit Chanchio, Thammasat University, Thailand
Zizhong Chen, Colorado School of Mines, USA
Walfredo Cirne, Google / Universidade Federal de Campina Grande, Brazil
John T. Daly, Department of Defense, USA
Nathan DeBardeleben, Los Alamos National Laboratory, USA
Christian Engelmann, Oak Ridge National Laboratory, USA
Yung-Chin Fang, Dell, USA
Ann Gentile, Sandia National Laboratories, USA
Paul Hargrove, Lawrence Berkeley National Laboratory, USA
Xubin He, Tennessee Tech University, USA
Daniel S. Katz, University of Chicago, USA
Thilo Kielmann, Vrije Universiteit Amsterdam, Netherlands
Dieter Kranzlmueller, LMU/LRZ Munich, Germany
Zhiling Lan, Illinois Institute of Technology, USA
Chokchai (Box) Leangsuksun, Louisiana Tech University, USA
Xiaosong Ma, North Carolina State University, USA
Celso Mendes, University of Illinois at Urbana Champaign, USA
Christine Morin, INRIA Rennes, France
Frank Mueller, North Carolina State University, USA
Thomas Naughton, Oak Ridge National Laboratory, USA
George Ostrouchov, Oak Ridge National Laboratory, USA
Li Ou, Dell, USA
DK Panda, The Ohio State University, USA
Mihaela Paun, Louisiana Tech University, USA
Alexander Reinefeld, Zuse Institute Berlin, Germany
Rolf Riesen, Sandia National Laboratories, USA
Stephen L. Scott, Oak Ridge National Laboratory, USA
Dan Stanzione, Texas Advanced Computing Center, USA
Jon Stearley, Sandia National Laboratories, USA
Xian-He Sun, Illinois Institute of Technology, USA
Gregory M. Thorson, SGI, USA
Geoffroy Vallee, Oak Ridge National Laboratory, USA
Sudharshan Vazhkudai, Oak Ridge National Laboratory, USA