Process-Level Fault Tolerance for Job Healing in HPC Environments
Stephen L. Scott, Oak Ridge National Laboratory
[Slides]

Abstract
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming commonplace. Frequently deployed checkpoint/restart mechanisms generally require a complete restart. Yet, some node failures can be anticipated by detecting a deteriorating health status in today's systems, which can be explored by proactive fault tolerance (FT). Our work proposes novel, scalable mechanisms in support of proactive FT and significant enhancements to reactive FT. The contributions are three-fold. First, we provide a transparent job pause service allowing live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Second, we complement reactive with proactive FT by a process-level live migration mechanism that supports continued execution of an application during much of migration. Third, we develop incremental checkpointing techniques to capture only data changed since the last checkpoint to reduce the cost of reactive FT.

Bio
Dr. Stephen L. Scott is a Senior Research Scientist and team leader of the System Software Research Team in the Computer Science Group of the Computer Science and Mathematics Division at the Oak Ridge National Laboratory (ORNL). Dr. Scott’s research interest is in experimental systems with a focus on high performance distributed, heterogeneous, and parallel computing. He is a founding member of the Open Cluster Group (OCG) and Open Source Cluster Application Resources (OSCAR). Within this organization, he has served as the OCG steering committee chair, as the OSCAR release manager, and as working group chair. Dr. Scott is the project lead principal investigator for the Reliability, Availability and Serviceability (RAS) for Petascale High-End Computing research team. This multi-institution research effort, funded by the Department of Energy - Office of Science, concentrates on adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale scientific high-end computing (HEC) as part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS). Dr. Scott is also principal investigator of a project investigating techniques in virtualized system environments for petascale computing and is Co-PI of a related storage effort, funded by the National Science Foundation, which is investigating the advantages of storage virtualization in petascale computing environments. Dr. Scott serves on a number of scientific advisory boards and is presently serving as the chair of the international Scientific Advisory Committee for the European Commission's XtreemOS project. Stephen has published over 100 peer-reviewed papers in the areas of parallel, cluster and distributed computing and holds both the Ph.D. and M.S. in computer science.

Workshop Index