Abstract
Berkeley Lab Checkpoint/Restart (BLCR, http://ftg.lbl.gov/checkpoint) is a
DOE funded effort to produce a production-quality system-level
checkpointing implementation suitable for use
in preemptive scheduling, migration and fault tolerance. The BLCR
implementation work is part of a larger multi-institution effort to define
a "Fault Tolerance Backplane" (FTB) for HP
C platforms, and to provide implementations of the system components that
interact with the FTB (including batch scheduler, checkpointer, and MPI
implementations among others). This talk will describe the goals of BLCR, its status, and its future
directions.
Bio
Paul H. Hargrove has been a full-time Principle Investigator at Lawrence
Berkeley National Laboratory since September 2000, and since June 2005 has
held an appointment in the Computer Science Division at the University of
California Berkeley. His current research interests include
checkpoint/restart for Linux, and high-performance cluster networks such
as InfiniBand. Current projects include Berkeley Lab Checkpoint/Restart
(BLCR) for Linux, Global Address Space Networking (GASNet), and Berkeley
Unified Parallel C (UPC).