Root Cause Analysis
Jon Stearley, Sandia National Laboratory
[Slides]

Abstract
Because the functional interdependencies among components is numerous, complex, and dynamic, determining the root cause of failures on HPC systems requires extensive knowledge, unwavering tenacity, and often, a good "hunch". The difficulty of this task on future systems however grows not simply with the increasing number of components, but combinatorially with their interdependencies. Furthermore, as global checkpoint/restart overheads increase, the importance of a focussed response to faults increases, which requires root cause determination. Consider a supercomputer as a graph where vertices are components (hardware or software), edges are dependencies (physical or functional), and labels are symptomatic factors (text, numeric thresholds, waveforms, etc) - is this model useful towards determining the root cause of failures within HPC systems to the benefit of human or automated responders?

Bio
Jon Stearley enjoys variety and challenge, vocationally ranging from electrical engineering, neuroimaging programming, infrastructure architecture, and resilient supercomputing. Having spent the majority of recent efforts on log analysis (http://www.cs.sandia.gov/sisyphus), he is currently seeking to expand his scope of system information to compute upon, focussing on novel methods to determine the root cause of failures.

Workshop Index