Abstract
Because the functional interdependencies among components is numerous,
complex, and dynamic, determining the root cause of failures on HPC
systems requires extensive knowledge, unwavering tenacity, and often, a
good "hunch". The difficulty of this task on future systems however grows
not simply with the increasing number of components, but combinatorially
with their interdependencies. Furthermore, as global checkpoint/restart
overheads increase, the importance of a focussed response to faults
increases, which requires root cause determination. Consider a
supercomputer as a graph where vertices are components (hardware or
software), edges are dependencies (physical or functional), and labels are
symptomatic factors (text, numeric thresholds, waveforms, etc) - is this
model useful towards determining the root cause of failures within HPC
systems to the benefit of human or automated responders?
Bio
Jon Stearley enjoys variety and challenge, vocationally ranging from
electrical engineering, neuroimaging programming, infrastructure
architecture, and resilient supercomputing. Having spent the majority of
recent efforts on log analysis (http://www.cs.sandia.gov/sisyphus), he is
currently seeking to expand his scope of system information to compute
upon, focussing on novel methods to determine the root cause of failures.