Home | Projects | Publications [BibTeX] | Opportunities | Resume

2009-...: Soft-Error Resilience for Future-Generation High-Performance Computing Systems

This project aims at developing a soft error resilience strategy for future-generation high-performance computing (HPC) systems. Soft errors are becoming the predominant source of interruptions in large-scale HPC systems. Double-error detection (DED) events that normally occur in a memory module with single-error correction (SEC) error correcting code (ECC) once within 1-2 million hours of operation can cause an error rate of 10-20 hours in a system with 100,000 modules. Moreover, vendors have warned that silent data corruption (SDC), i.e., undetected bit flips, are becoming a problem as well. This project targets two different solutions aiming at alleviating the issue of soft errors in large-scale HPC systems: (1) checkpoint storage virtualization to significantly improve checkpoint/restart times, and (2) software dual-modular redundancy (DMR) to eliminate rollback/recovery in HPC. The checkpoint storage virtualization aggregates a variety of back-end resources, such as flash, memory, or both, and uses them in conjunction with traditional parallel file systems. Applications are able to use it in a seamless fashion through the standard file system interface with high read/write throughput. The core concept of the DMR technology relies on software-level replication of computational processes using the sate-machine replication approach and on process cloning technology for fast recovery.

2008-...: Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond

This project aims at scalable technologies for providing high-level RAS for next-generation petascale scientific high-end computing (HEC) resources and beyond as outlined by the U.S. Department of Energy (DOE) Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) and the U.S. National Coordination Office for Networking and Information Technology Research and Development (NCO/NITRD) High-End Computing Revitalization Task Force (HECRTF) activities. Based on virtualized adaptation, reconfiguration, and preemptive measures, the ultimate goal is to provide for non-stop scientific computing on a 24x7 basis without interruption. The taken technical approach leverages system-level virtualization technology to enable transparent proactive and reactive fault tolerance mechanisms on extreme scale HEC systems. This effort targets: (1) reliability analysis for identifying pre-fault indicators, predicting failures, and modeling and monitoring component and system reliability, (2) proactive fault tolerance technology based on preemptive migration away from components that are about to fail, (3) reactive fault tolerance enhancements, such as checkpoint interval and placement adaption to actual and predicted system health threats, and (4) holistic fault tolerance through combination of adaptive proactive and reactive fault tolerance. For more information, please visit www.fastos.org/ras.

Select Publications ( Abstract, Publication, Presentation, Citation, DOI)

2008-...: Scalable Algorithms for Petascale Systems with Multicore Architectures

This work is part of the U.S. Department of Energy's Institute for Advanced Architecture and Algorithms (IAA). It was established in 2008 to facilitate the co-design of architectures and applications in order to create synergy in their respective evolutions for closing the gap between the peak capabilities of the hardware and the performance realized by high performance computing applications (application-architecture performance gap). This project focuses on the development of architecture-aware algorithms and the supporting runtime features needed by these algorithms to solve general sparse linear systems common in many scientific applications. Targeted architecture-aware algorithms include (1) multi-precision Krylov solvers, preconditioners, and multi-level smoothers, (2) multi-resolution, multi-precision fast Poisson and Helmholtz solvers, (3) multi-core aware hybrid algorithms for preconditioning, and (4) parallel-in-time algorithms based on Krylov Deferred Correction. Targeted features within an architecture-aware runtime environment include multi-core aware Message Passing Interface (MPI) memory allocation, multi-level MPI communicators, and process-to-core and memory-to-core affinity. This project further focuses on evaluating the algorithmic impact of future architecture choices and determining what architecture changes would have the highest impact. The evaluation includes (1) detailed performance analyses of key computational kernels on different simulated node architectures, (2) analysis and development of new memory access capabilities that may improve use of memory bandwidth and cache memory resources, and (3) simulation of system architectures at full scale to evaluate the scalability and fault tolerance behavior of key science algorithms. For more information, please visit www.csm.ornl.gov/iaa.

2006-09: Harness Workbench: Unified and Adaptive Access to Diverse HPC Platforms

The goal of this project is to enhance the overall productivity of applications science on diverse high performance computing platforms by conducting research in two innovative software environments. The first is a virtualized command toolkit (VCT) for application building and execution that provides a common view across diverse HPC systems. The VCT consists of a software backplane architecture that presents a uniform but extensible interface for preparatory and pre-execution stages of application execution, which interfaces to instance-specific software via customizable plug-in modules. The second is a next generation runtime environment (RTE) that similarly provides a flexible, adaptive framework for plugging in modules optimized for a specific HPC system and allows dynamic interfacing to a variety of user environments. Both these environments will employ platform specific pluggable modules disseminating target-specific knowledge and expertise immediately to all end-users who can continue to interface to a familiar environment. For more information, please visit www.csm.ornl.gov/harness.

Select Publications ( Abstract, Publication, Presentation, Citation, DOI)

2006-08: Virtualized System Environments for Petascale Computing and Beyond

This research project intends to address scalability, manageability, and ease-of-use challenges in petascale system software and application runtime environments through the development of a virtual system environment (VSE). In addition to providing a scalable and reliable "sandbox" environment for scientific application development on desktops and clusters, the VSE will offer an identical production environment for scientific application deployment on terascale and petascale HEC systems. The VSE concept enables "plug-and-play" supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor virtualization technologies. The overall goal of this effort is to advance the race for scientific discovery through computation by enabling day-one operation capability of newly installed systems and by improving productivity of scientific application development and deployment.

Select Publications ( Abstract, Publication, Presentation, Citation, DOI)

2004-07: MOLAR - Modular Linux and Adaptive Runtime Support for High-End Computing

This project was a multi-institution research effort that concentrated on adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. It addressed the challenges outlined by the U.S. Department of Energy (DOE) Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) and the U.S. National Coordination Office for Networking and Information Technology Research and Development (NCO/NITRD) High-End Computing Revitalization Task Force (HECRTF) activities by providing an adaptable runtime support for high-end computing operating and runtime systems. This research primarily concentrated on advancing computer reliability, availability and serviceability (RAS) management systems to run large and long-running applications efficiently on future ultra-scale computers, and on providing advanced monitoring and adaptation mechanisms for improved application performance and predictability. For more information, please visit www.fastos.org/molar.

Select Publications ( Abstract, Publication, Presentation, Citation, DOI)

2004-06: Reliability, Availability, and Serviceability (RAS) for Terascale Computing

The goal of this work was to produce a proof-of-concept solution that will enable the removal of the numerous single points of failure in large systems while improving scalability and access to systems and data. Our research effort focused on efficient redundancy strategies for head and service nodes as well as on a distributed storage infrastructure. We developed replication mechanisms for providing symmetric active/active high availability for services running on head and service nodes in order to offer the highest level of availability without significantly impacting performance. The implemented prototypes for the batch job management system, TORQUE, and the parallel virtual file system (PVFS) metadata server offer 99.9997% service uptime using just 3 redundant nodes. For distributed data storage, the developed FreeLoader solution is built on a contributed desktop storage substrate. We developed parallel I/O mechanisms to store/access data to/from network workstations as well as caching mechanisms to store more recently used datasets. FreeLoader can offer high retrieval rates for large datasets using novel striping strategies. It also may be utilized as a virtual cache, storing only prefixes of datasets and yet delivering the entire dataset by masking the suffix patching.

Select Publications ( Abstract, Publication, Presentation, Citation, DOI)

2002-04: Super-Scalable Algorithms for Next-Generation High-Performance Cellular Architectures

This research in cellular architectures was part of a Cooperative Research and Development Agreement (CRADA) between IBM and ORNL to develop algorithms for the next-generation of supercomputers. It focused on the development of algorithms that are able to use a 100,000-processor machine efficiently and are capable of adapting to or simply surviving faults. Such huge computer systems, like the IBM Blue Gene/L, need to address already existing problems in algorithm scalability and fault-tolerance, which continue to increase with processor scale. In a first step, the team at ORNL developed a simulator in Java, since a 100,000-processor machine was not available. A prototype of the Java Cellular Architecture Simulator (JCAS) was presented at the ACM/IEEE International Conference on Supercomputing (SC) in 2001 and was able to emulate up to 5000 virtual processors on a single real processor solving Laplace`s equation. Another demonstration at the follow-up conference in 2002 was capable of emulating up to 500,000 virtual processors on a cluster with 5 real processors solving Laplace's equation and the global maximum problem. The following software releases are available:

Software Releases ( BZ2 Source TAR, GZ Source TAR, Source RPM, README)

Select Publications ( Abstract, Publication, Presentation, Citation, DOI)

2000-05: Harness - Heterogeneous Distributed Computing

The heterogeneous adaptable reconfigurable networked systems (Harness) research project focused on the design and development of a pluggable lightweight heterogeneous Distributed Virtual Machine (DVM) environment, where clusters of PCs, workstations, and ``big iron'' supercomputers can be aggregated to form one giant DVM (in the spirit of its widely-used predecessor, Parallel Virtual Machine (PVM)). As part of the Harness project, a variety of experiments and system prototypes were developed to explore lightweight pluggable frameworks, adaptive reconfigurable runtime environments, assembly of scientific applications from software modules, parallel plug-in paradigms, highly available DVMs, fault-tolerant message passing (FT-MPI), fine-grain security mechanisms, and heterogeneous reconfigurable communication frameworks. Three different Harness system prototypes were developed, two C variants and one Java-based alternative, each concentrating on different research issues. The technology developed within the Harness project influenced many other research and development efforts, such as the Open Run-Time Environment (ORTE). For more information, please visit www.csm.ornl.gov/harness. The following software releases are available:

Software Releases ( BZ2 Source TAR, GZ Source TAR, Source RPM, README)

Select Publications ( Abstract, Publication, Presentation, Citation, DOI)