Resilience: Sacrificing Previous Convictions About Physical Laws
John T. Daly, Los Alamos National Laboratory
Abstract
Without a new paradigm of resilience, fleshed out through methods and tools
that keep the application running in spite of underlying component failures,
the end-to-end performance of extreme scale compute platforms, will plateau
and eventually decline as a result of these increasing interrupt rates. We
will attempt to examine these concepts rigorously and demonstrate in
quantifiable ways why approaches emerging from traditional paradigms of
reliability will not continue to be cost effective or power efficient means
to keep large applications running.
Bio
John Daly is a technical staff member in the HPC division at the Los Alamos
National Laboratory. His research interests include application fault
tolerance, system reliability, application resilience, calculational
correctness, and mathematical modeling of application throughput. John is
experienced in porting and running large-scale simulations to a variety of
architectures and developing metrics and methods for measuring and enhancing
HPC utilization. He has accumulated in excess of 40 million processor hours
of computer time running on Red Storm, Purple, and BG/L. John holds degrees
in engineering from Caltech and Princeton University, where he studied
computational fluid dynamics under Antony Jameson. He has also worked as an
application analyst and software developer for Raytheon Intelligence and
Information Systems.