3rd Workshop on Resiliency in High Performance Computing (Resilience)
in Clusters, Clouds, and Grids Overview: Clusters, Clouds, and Grids are three different computational
paradigms with the intent or potential to support High Performance
Computing (HPC). Currently, they consist of hardware, management, and
usage models particular to different computational regimes, e.g., high
performance cluster systems designed to support tightly coupled
scientific simulation codes typically utilize high-speed interconnects
and commercial cloud systems designed to support software as a service
(SAS) do not. However, in order to support HPC, all must at least
utilize large numbers of resources and hence effective HPC in any of
these paradigms must address the issue of resiliency at large-scale. Resilience 2010 is the follow-on workshop to the successful
Resilience 2009 held with HPDC in Munich, Germany, and the earlier
Resilience 2008 held in conjunction with CCGrid in Lyon, France. Tech Program:
09:00 AM – 9:30 AM Welcome/Introduction
Christian Engelmann, Workshop Program Chair
Prior conferences websites:
Authors are invited to submit papers electronically. Submitted manuscripts should be structured as technical papers and may not exceed 6 letter size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings (print area of 6-1/2 inches (16.51 cm) wide by 8-7/8 inches (22.51 cm) high, two-column format with columns 3-1/16 inches (7.85 cm) wide with a 3/8 inch (0.81 cm) space between them, single-spaced 10-point Times fully justified text). Submissions not conforming to these guidelines may be returned without review. Authors should submit the manuscript in PDF format andmake sure that the file will print on a printer that uses letter size(8.5 x 11) paper. The official language of the meeting is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. At least one author of an accepted paper must register for and attend the workshop. Authors may contact the workshop program chair for more information. The proceedings will be published through the IEEE Computer Society Press, USA and will be made online through the IEEE Digital Library. Papers should be submitted electronically in the IEEE conference proceedings style as PDF to the workshop submission Web site at <http://www.easychair.org/conferences/?conf=resilience2010>. For manuscript preparation with LaTeX, use the newer unofficial CTAN from <http://www.ctan.org/tex-archive/macros/latex/contrib/IEEEconf> or the older official IEEE conference proceedings template available at <ftp://pubftp.computer.org/Press/Outgoing/proceedings/IEEE_CS_Latex8.5x11.zip>. For Microsoft Word, use the official proceedings template available at <ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.doc>. Topics of interest include, but are not limited to: • Reports on current HPC system and application resiliency • HPC resiliency metrics and standards • HPC system and application resiliency analysis • HPC system and application-level fault handling and anticipation • HPC system and application health monitoring • Resiliency for HPC file and storage systems • System-level checkpoint/restart for HPC • System-level migration for HPC • Algorithm-based resiliency fundamentals for HPC (not Hadoop) • Fault tolerant MPI concepts and solutions • Soft error detection and recovery in HPC systems • HPC system and application log analysis • Statistical methods to identify failure root causes • Fault injection studies in HPC environments • High availability solutions for HPC systems • Reliability and availability analysis • Hardware for fault detection and recovery • Resource management for system resiliency and availability General Co-Chairs: • Stephen L. Scott Computer Science and Mathematics Division Oak Ridge National Laboratory, USA scottsl@ornl.gov • Chokchai (Box) Leangsuksun, SWEPCO Endowed Associate Professor of Computer Science Louisiana Tech University, USA box@latech.edu Program Chair: • Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory , USA engelmannc@ornl.gov Publication Co-Chairs: • James Brandt Sandia National Laboratories, USA brandt@sandia.gov • Ann Gentile Sandia National Laboratories, USA gentile@sandia.gov Program Committee: James Brandt, Sandia National Laboratories, USA George Bosilca, University of Tennessee, USA Aurelien Bouteiller, University of Tennessee, USA Greg Bronevetsky, Lawrence Livermore National Laboratory, USA Franck Cappello, INRIA Paris, France Kasidit Chanchio, Thammasat University, Thailand Zizhong Chen, Colorado School of Mines, USA Walfredo Cirne, Google / Universidade Federal de Campina Grande, Brazil John T. Daly, Department of Defense, USA Nathan DeBardeleben, Los Alamos National Laboratory, USA Christian Engelmann, Oak Ridge National Laboratory, USA Yung-Chin Fang, Dell, USA Ann Gentile, Sandia National Laboratories, USA Paul Hargrove, Lawrence Berkeley National Laboratory, USA Xubin He, Tennessee Tech University, USA Daniel S. Katz, University of Chicago, USA Thilo Kielmann, Vrije Universiteit Amsterdam, Netherlands Dieter Kranzlmueller, LMU/LRZ Munich, Germany Zhiling Lan, Illinois Institute of Technology, USA Chokchai (Box) Leangsuksun, Louisiana Tech University, USA Xiaosong Ma, North Carolina State University, USA Celso Mendes, University of Illinois at Urbana Champaign, USA Christine Morin, INRIA Rennes, France Frank Mueller, North Carolina State University, USA Thomas Naughton, Oak Ridge National Laboratory, USA George Ostrouchov, Oak Ridge National Laboratory, USA Li Ou, Dell, USA DK Panda, The Ohio State University, USA Mihaela Paun, Louisiana Tech University, USA Alexander Reinefeld, Zuse Institute Berlin, Germany Rolf Riesen, Sandia National Laboratories, USA Stephen L. Scott, Oak Ridge National Laboratory, USA Dan Stanzione, Texas Advanced Computing Center, USA Jon Stearley, Sandia National Laboratories, USA Xian-He Sun, Illinois Institute of Technology, USA Gregory M. Thorson, SGI, USA Geoffroy Vallee, Oak Ridge National Laboratory, USA Sudharshan Vazhkudai, Oak Ridge National Laboratory, USA |