Abstract
In order to address anticipated high failure rates, resiliency
characteristics have become an urgent priority for next-generation
high-performance computing (HPC) systems. One major source of concern are
non-recoverable soft errors, i.e., bit flips in memory, cache, registers,
and logic. The probability of such errors not only grows with system size,
but also with increasing architectural vulnerability caused by employing
accelerators and by shrinking nanometer technology. Reactive fault
tolerance technologies, such as checkpoint/restart, are unable to handle
high failure rates due to associated overheads, while proactive resiliency
technologies, such as preemptive migration, simply fail as random soft
errors can't be predicted. This talk proposes a new, bold direction in
resiliency for HPC as it targets resiliency for next-generation
extreme-scale HPC systems at the system software level through
computational redundancy strategies, i.e., dual- and triple-modular
redundancy.
Bio
Christian Engelmann is a R&D Staff Member in the System Research Team
of the Computer Science Research Group in the Computer Science and
Mathematics Division at the Oak Ridge National Laboratory (ORNL). He holds
a MSc in Computer Science from the University of Reading and a MSc in
Computer Systems Engineering from the Technical College for Engineering
and Economics (FHTW) Berlin. As part of his research activities at ORNL,
Christian is currently pursuing a PhD in Computer Science at the
University of Reading. His research aims at providing high-level
reliability, availability, and serviceability for next-generation
supercomputers to improve their resiliency (and ultimately efficiency)
with novel high availability and fault tolerance system software
solutions. Another research area concentrates on "plug-and-play"
supercomputing, where transparent portability eliminates most of the
software modifications caused by divers platforms and system upgrades.