As the number of components comprising computer systems has grown, so has
the need to deal with component failure for applications to utilize the
full capabilities of these systems. As we face an explosion in system
size, it is important to consider fault-tolerance through the full stack,
from the hardware clear to the application, if we are to use the full
capabilities of these emerging systems. The MPI Forum is currently
considering what changes to make to the MPI standard to deal with failure.
This talk will present the direction being taken by the MPI Forum's Fault
Tolerance working group for responding to failures.
Gregory A. Koenig is an R&D Associate at Oak Ridge National Laboratory
where his work involves developing scalable runtime systems and parallel
tools for ultrascale-class parallel computers. His interests also include middleware for grid and
on-demand/utility computing incorporating technologies such as
virtualization, fault detection and avoidance, and resource scheduling.
He holds a PhD (2007) and MS (2003) in computer science from the
University of Illinois at Urbana-Champaign as well as three BS degrees
(mathematics, 1996; electrical engineering technology, 1995; computer science, 1993) from Indiana
University-Purdue University Fort Wayne.