Home | Technologies | Projects | Publications | Opportunities

Ongoing Projects

Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond

This project aims at scalable technologies for providing high-level RAS for next-generation petascale scientific high-end computing (HEC) resources and beyond as outlined by the U.S. Department of Energy (DOE) Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) and the U.S. National Coordination Office for Networking and Information Technology Research and Development (NCO/NITRD) High-End Computing Revitalization Task Force (HECRTF) activities. Based on virtualized adaptation, reconfiguration, and preemptive measures, the ultimate goal is to provide for non-stop scientific computing on a 24x7 basis without interruption. The taken technical approach leverages system-level virtualization technology to enable transparent proactive and reactive fault tolerance mechanisms on extreme scale HEC systems. This effort targets: (1) reliability analysis for identifying pre-fault indicators, predicting failures, and modeling and monitoring component and system reliability, (2) proactive fault tolerance technology based on preemptive migration away from components that are about to fail, (3) reactive fault tolerance enhancements, such as checkpoint interval and placement adaption to actual and predicted system health threats, and (4) holistic fault tolerance through combination of adaptive proactive and reactive fault tolerance. For more information, please visit www.fastos.org/ras.

Virtualized System Environments for Petascale Computing and Beyond

This research project intends to address scalability, manageability, and ease-of-use challenges in petascale system software and application runtime environments through the development of a virtual system environment (VSE). In addition to providing a scalable and reliable "sandbox" environment for scientific application development on desktops and clusters, the VSE will offer an identical production environment for scientific application deployment on terascale and petascale HEC systems. The VSE concept enables "plug-and-play" supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor virtualization technologies. The overall goal of this effort is to advance the race for scientific discovery through computation by enabling day-one operation capability of newly installed systems and by improving productivity of scientific application development and deployment.

Harness Workbench: Unified and Adaptive Access to Diverse HPC Platforms

The goal of this project is to enhance the overall productivity of applications science on diverse high performance computing platforms by conducting research in two innovative software environments. The first is a virtualized command toolkit (VCT) for application building and execution that provides a common view across diverse HPC systems. The VCT consists of a software backplane architecture that presents a uniform but extensible interface for preparatory and pre-execution stages of application execution, which interfaces to instance-specific software via customizable plug-in modules. The second is a next generation runtime environment (RTE) that similarly provides a flexible, adaptive framework for plugging in modules optimized for a specific HPC system and allows dynamic interfacing to a variety of user environments. Both these environments will employ platform specific pluggable modules disseminating target-specific knowledge and expertise immediately to all end-users who can continue to interface to a familiar environment. For more information, please visit www.csm.ornl.gov/harness.

Past Projects

MOLAR: Modular Linux and Adaptive Runtime Support for High-End Computing

This project was a multi-institution research effort that concentrated on adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. It addressed the challenges outlined by the U.S. Department of Energy (DOE) Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) and the U.S. National Coordination Office for Networking and Information Technology Research and Development (NCO/NITRD) High-End Computing Revitalization Task Force (HECRTF) activities by providing an adaptable runtime support for high-end computing operating and runtime systems. This research primarily concentrated on advancing computer reliability, availability and serviceability (RAS) management systems to run large and long-running applications efficiently on future ultra-scale computers, and on providing advanced monitoring and adaptation mechanisms for improved application performance and predictability. For more information, please visit www.fastos.org/molar.

Reliability, Availability, and Serviceability (RAS) for Terascale Computing

The goal of this work was to produce a proof-of-concept solution that will enable the removal of the numerous single points of failure in large systems while improving scalability and access to systems and data. Our research effort focused on efficient redundancy strategies for head and service nodes as well as on a distributed storage infrastructure. We developed replication mechanisms for providing symmetric active/active high availability for services running on head and service nodes in order to offer the highest level of availability without significantly impacting performance. The implemented prototypes for the batch job management system, TORQUE, and the parallel virtual file system (PVFS) metadata server offer 99.9997% service uptime using just 3 redundant nodes. For distributed data storage, the developed FreeLoader solution is built on a contributed desktop storage substrate. We developed parallel I/O mechanisms to store/access data to/from network workstations as well as caching mechanisms to store more recently used datasets. FreeLoader can offer high retrieval rates for large datasets using novel striping strategies. It also may be utilized as a virtual cache, storing only prefixes of datasets and yet delivering the entire dataset by masking the suffix patching.

Harness: Heterogeneous Distributed Computing

The heterogeneous adaptable reconfigurable networked systems (Harness) research project focused on the design and development of a pluggable lightweight heterogeneous Distributed Virtual Machine (DVM) environment, where clusters of PCs, workstations, and ``big iron'' supercomputers can be aggregated to form one giant DVM (in the spirit of its widely-used predecessor, Parallel Virtual Machine (PVM)). As part of the Harness project, a variety of experiments and system prototypes were developed to explore lightweight pluggable frameworks, adaptive reconfigurable runtime environments, assembly of scientific applications from software modules, parallel plug-in paradigms, highly available DVMs, fault-tolerant message passing (FT-MPI), fine-grain security mechanisms, and heterogeneous reconfigurable communication frameworks. Three different Harness system prototypes were developed, two C variants and one Java-based alternative, each concentrating on different research issues. The technology developed within the Harness project influenced many other research and development efforts, such as the Open Run-Time Environment (ORTE). For more information, please visit www.csm.ornl.gov/harness. The following software releases are available:

  • Harness Runtime Environment - harness-2.0.b0
Super-Scalable Algorithms for Next-Generation High-Performance Cellular Architectures

This research in cellular architectures was part of a Cooperative Research and Development Agreement (CRADA) between IBM and ORNL to develop algorithms for the next-generation of supercomputers. It focused on the development of algorithms that are able to use a 100,000-processor machine efficiently and are capable of adapting to or simply surviving faults. Such huge computer systems, like the IBM Blue Gene/L, need to address already existing problems in algorithm scalability and fault-tolerance, which continue to increase with processor scale. In a first step, the team at ORNL developed a simulator in Java, since a 100,000-processor machine was not available. A prototype of the Java Cellular Architecture Simulator (JCAS) was presented at the ACM/IEEE International Conference on Supercomputing (SC) in 2001 and was able to emulate up to 5000 virtual processors on a single real processor solving Laplace`s equation. Another demonstration at the follow-up conference in 2002 was capable of emulating up to 500,000 virtual processors on a cluster with 5 real processors solving Laplace`s equation and the global maximum problem. The following software releases are available:

  • Java Cellular Architecture Simulator - jcas-6.2-b1
Copyright © 2001-2008, Christian Engelmann. All Rights Reserved. Last Modified: Tuesday, 08-Jul-2008 16:35:32 EDT