Resilience: Sacrificing Previous Convictions About Physical Laws
John T. Daly, Los Alamos National Laboratory

Abstract
Without a new paradigm of resilience, fleshed out through methods and tools that keep the application running in spite of underlying component failures, the end-to-end performance of extreme scale compute platforms, will plateau and eventually decline as a result of these increasing interrupt rates. We will attempt to examine these concepts rigorously and demonstrate in quantifiable ways why approaches emerging from traditional paradigms of reliability will not continue to be cost effective or power efficient means to keep large applications running.

Bio
John Daly is a technical staff member in the HPC division at the Los Alamos National Laboratory. His research interests include application fault tolerance, system reliability, application resilience, calculational correctness, and mathematical modeling of application throughput. John is experienced in porting and running large-scale simulations to a variety of architectures and developing metrics and methods for measuring and enhancing HPC utilization. He has accumulated in excess of 40 million processor hours of computer time running on Red Storm, Purple, and BG/L. John holds degrees in engineering from Caltech and Princeton University, where he studied computational fluid dynamics under Antony Jameson. He has also worked as an application analyst and software developer for Raytheon Intelligence and Information Systems.

Workshop Index