System-level Checkpoint/Restart with BLCR
Paul Hargrove, Lawrence Berkeley National Laboratory

Berkeley Lab Checkpoint/Restart (BLCR, is a DOE funded effort to produce a production-quality system-level checkpointing implementation suitable for use in preemptive scheduling, migration and fault tolerance. The BLCR implementation work is part of a larger multi-institution effort to define a "Fault Tolerance Backplane" (FTB) for HP C platforms, and to provide implementations of the system components that interact with the FTB (including batch scheduler, checkpointer, and MPI implementations among others). This talk will describe the goals of BLCR, its status, and its future directions.

Paul H. Hargrove has been a full-time Principle Investigator at Lawrence Berkeley National Laboratory since September 2000, and since June 2005 has held an appointment in the Computer Science Division at the University of California Berkeley. His current research interests include checkpoint/restart for Linux, and high-performance cluster networks such as InfiniBand. Current projects include Berkeley Lab Checkpoint/Restart (BLCR) for Linux, Global Address Space Networking (GASNet), and Berkeley Unified Parallel C (UPC).

Workshop Index