Abstract
As the number of nodes in high-performance computing environments keeps
increasing, faults are becoming commonplace. Frequently deployed
checkpoint/restart mechanisms generally require a complete restart. Yet,
some node failures can be anticipated by detecting a deteriorating health
status in today's systems, which can be explored by proactive fault
tolerance (FT). Our work proposes novel, scalable mechanisms in support
of proactive FT and significant enhancements to reactive FT. The
contributions are three-fold. First, we provide a transparent job pause
service allowing live nodes to remain active and roll back to the last
checkpoint while failed nodes are dynamically replaced by spares before
resuming from the last checkpoint. Second, we complement reactive with
proactive FT by a process-level live migration mechanism that supports
continued execution of an application during much of migration. Third, we
develop incremental checkpointing techniques to capture only data changed
since the last checkpoint to reduce the cost of reactive FT.
Bio
Dr. Stephen L. Scott is a Senior Research Scientist and team leader of the
System Software Research Team in the Computer Science Group of the
Computer Science and Mathematics Division at the Oak Ridge National
Laboratory (ORNL). Dr. Scottâs research interest is in experimental
systems with a focus on high performance distributed, heterogeneous, and
parallel computing. He is a founding member of the Open Cluster Group
(OCG) and Open Source Cluster Application Resources (OSCAR). Within this
organization, he has served as the OCG steering committee chair, as the
OSCAR release manager, and as working group chair. Dr. Scott is the
project lead principal investigator for the Reliability, Availability and
Serviceability (RAS) for Petascale High-End Computing research team. This
multi-institution research effort, funded by the Department of Energy -
Office of Science, concentrates on adaptive, reliable, and efficient
operating and runtime system solutions for ultra-scale scientific high-end
computing (HEC) as part of the Forum to Address Scalable Technology for
Runtime and Operating Systems (FAST-OS). Dr. Scott is also principal
investigator of a project investigating techniques in virtualized system
environments for petascale computing and is Co-PI of a related storage
effort, funded by the National Science Foundation, which is investigating
the advantages of storage virtualization in petascale computing
environments. Dr. Scott serves on a number of scientific advisory boards
and is presently serving as the chair of the international Scientific
Advisory Committee for the European Commission's XtreemOS project. Stephen
has published over 100 peer-reviewed papers in the areas of parallel,
cluster and distributed computing and holds both the Ph.D. and M.S. in
computer science.