Imperfections are an unavoidable characteristic of complex systems; the
costs of these imperfections make it imperative for us to devise generic
methods for effectively detecting and isolating them. Toward this end, we
present a technique that infers the dependency structure of a system by
looking for anomalous behavior correlated in time across components. I'll
present some early results on a supercomputer and an autonomous vehicle,
as well as provide a motivational survey of my work on system management:
job scheduling, quality of service guarantees, checkpointing, and log
Adam Oliner is a third-year PhD student in the Computer Science Department
at Stanford University, working with Alex Aiken. He is a DOE High
Performance Computer Science Fellow and honorary Stanford Graduate Fellow.
Before coming to Stanford, he earned a Master's of Engineering in
electrical engineering and computer science at MIT, where he also received
undergraduate degrees in computer science and mathematics. He interned
several times at IBM with the Blue Gene/L system software team and spent a
summer studying supercomputers logs at Sandia National Labs.