Modular Redundancy in HPC Systems: Why, Where, When and How?
Christian Engelmann, Oak Ridge National Laboratory

In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation high-performance computing (HPC) systems. One major source of concern are non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as preemptive migration, simply fail as random soft errors can't be predicted. This talk proposes a new, bold direction in resiliency for HPC as it targets resiliency for next-generation extreme-scale HPC systems at the system software level through computational redundancy strategies, i.e., dual- and triple-modular redundancy.

Christian Engelmann is a R&D Staff Member in the System Research Team of the Computer Science Research Group in the Computer Science and Mathematics Division at the Oak Ridge National Laboratory (ORNL). He holds a MSc in Computer Science from the University of Reading and a MSc in Computer Systems Engineering from the Technical College for Engineering and Economics (FHTW) Berlin. As part of his research activities at ORNL, Christian is currently pursuing a PhD in Computer Science at the University of Reading. His research aims at providing high-level reliability, availability, and serviceability for next-generation supercomputers to improve their resiliency (and ultimately efficiency) with novel high availability and fault tolerance system software solutions. Another research area concentrates on "plug-and-play" supercomputing, where transparent portability eliminates most of the software modifications caused by divers platforms and system upgrades.

Workshop Index