Towards Support for Fault Tolerance in the MPI Standard
Greg Koenig, Oak Ridge National Laboratory
[Slides]

Abstract
As the number of components comprising computer systems has grown, so has the need to deal with component failure for applications to utilize the full capabilities of these systems. As we face an explosion in system size, it is important to consider fault-tolerance through the full stack, from the hardware clear to the application, if we are to use the full capabilities of these emerging systems. The MPI Forum is currently considering what changes to make to the MPI standard to deal with failure. This talk will present the direction being taken by the MPI Forum's Fault Tolerance working group for responding to failures.

Bio
Gregory A. Koenig is an R&D Associate at Oak Ridge National Laboratory where his work involves developing scalable runtime systems and parallel tools for ultrascale-class parallel computers. His interests also include middleware for grid and on-demand/utility computing incorporating technologies such as virtualization, fault detection and avoidance, and resource scheduling. He holds a PhD (2007) and MS (2003) in computer science from the University of Illinois at Urbana-Champaign as well as three BS degrees (mathematics, 1996; electrical engineering technology, 1995; computer science, 1993) from Indiana University-Purdue University Fort Wayne.

Workshop Index