Combining System Characterization and Novel Execution Models to Achieve Scalable Robust Computing
Jim Brandt, Sandia National Laboratory
Abstract
New platforms are growing in both size and complexity, both within a node
element and within the high-bandwidth, low-latency networks which provide
the communication paths between node elements. Multi-core architectures
add even more diversity to communication paths and contention for
resources as the core count per socket continues to grow. Furthermore,
corresponding growth in component count contributes to an ever shrinking
system wide mean time to component failure. Understanding the
heterogeneous and hierarchical nature of the platform will allow better
utilization of the underlying platform resources and better handling of
failure or expected failure situations. This talk presents our ongoing
work on using system characterization and resource state monitoring and
analysis in conjunction with intelligent resource management and existing
and new programming models to not only make applications more resilient to
system faults but more efficient.
Bio
Jim Brandt has been involved in research in high-performance computing
platforms, performance optimization tools, and informatics for over 10
years. He is the lead of Sandia's OVIS (http://ovis.ca.sandia.gov) project
which is developing an open-source tool for Intelligent Real-time
Monitoring and Analysis of Large HPC clusters. OVIS has been used for
analyzing system data from Sandia's Red Storm, Thunderbird, TLCC, and
Talon clusters as well as chemical sensor data in conjunction with
Sandia's SNIFFER project. Jim's relevant workshop organization activities
include: organizer of the 2006 Tri-lab RAS workshop, chair of the 2008
Sandia Workshop on Data Mining and Data Analysis, and organizer of the
2007 Red Storm performance optimization workshop.