Efficient and Flexible Fault Tolerance and Migration of Scientific Simulations Using CUMULVS

2nd SIGMETRICS Symposium on Parallel and Distributed Tools, Welches, OR

8/3/98


Click here to start


Table of Contents

Efficient and Flexible Fault Tolerance and Migration of Scientific Simulations Using CUMULVS

Motivation

(Collaborative User Migration, User Library for Visualization and Steering)

CUMULVS Approach

Why Instrument (Non-Transparent)?

Why Does the User Need to Help?

Identifying Program State

Checkpoint Consistency (Yuk…)

And For Your Trouble...

Rollback versus Restart…

Run-Time System Architecture

Checkpointing API

Example Instrumentation CUMULVS Initialization

Example Instrumentation Data Field Description

Example Instrumentation Restart from a Checkpoint

Example Instrumentation Periodic Handling - Restart

Example Instrumentation Periodic Handling - Rollback

Example Instrumentation Finished Checkpointing

Case Study I - Seismic Simulation Finite Difference Approximation

Case Study II - Air Flow Over Wing Computational Fluid Dynamics (CFD)

Instrumentation Cost

Checkpointing Overhead

Summary

Author: James Arthur Kohl

Email: kohl@msr.epm.ornl.gov

CUMULVS Home Page: http://www.epm.ornl.gov/cs/cumulvs.html