Combining System Characterization and Novel Execution Models to Achieve Scalable Robust Computing
Jim Brandt, Sandia National Laboratory

Abstract
New platforms are growing in both size and complexity, both within a node element and within the high-bandwidth, low-latency networks which provide the communication paths between node elements. Multi-core architectures add even more diversity to communication paths and contention for resources as the core count per socket continues to grow. Furthermore, corresponding growth in component count contributes to an ever shrinking system wide mean time to component failure. Understanding the heterogeneous and hierarchical nature of the platform will allow better utilization of the underlying platform resources and better handling of failure or expected failure situations. This talk presents our ongoing work on using system characterization and resource state monitoring and analysis in conjunction with intelligent resource management and existing and new programming models to not only make applications more resilient to system faults but more efficient.

Bio
Jim Brandt has been involved in research in high-performance computing platforms, performance optimization tools, and informatics for over 10 years. He is the lead of Sandia's OVIS (http://ovis.ca.sandia.gov) project which is developing an open-source tool for Intelligent Real-time Monitoring and Analysis of Large HPC clusters. OVIS has been used for analyzing system data from Sandia's Red Storm, Thunderbird, TLCC, and Talon clusters as well as chemical sensor data in conjunction with Sandia's SNIFFER project. Jim's relevant workshop organization activities include: organizer of the 2006 Tri-lab RAS workshop, chair of the 2008 Sandia Workshop on Data Mining and Data Analysis, and organizer of the 2007 Red Storm performance optimization workshop.

Workshop Index