Proactive fault tolerance aims at identifying precursors to imminent failures and preemptively moving computation away from components that are about to fail. As a consequence, application failures are avoided and application mean-time to failure (AMTTF) is extended.
Proactive fault tolerance, also often referred to as preemptive migration, moves parts of an application (tasks, processes, or virtual machines) away from unhealthy components. The concept relies on a reliability-aware runtime framework with near-real-time failure prediction based on system health monitoring and statistical modeling. Pre-failure indicators, such as a significant increase in heat, an unusual number of network communication errors, or a fan fault, can be used to avoid an imminent application failure through anticipation and reconfiguration.
This technology is further able to tune existing reactive fault recovery solutions to actual and predicted system health threats. For example, the checkpoint frequency may be adjusted based on the current reliability of the system, or checkpoints may be initiated based on predicted failures. Additionally, the overall costs of fault tolerance measures may be significantly reduced by combining reactive and proactive solutions, such that a system preemptively migrates and very infrequently checkpoints applications, and restarts them only in the event of an unpredicted failure. The more failures are avoided by proactive fault tolerance mechanisms, the less failures need to be handled by reactive fault tolerance mechanisms, such as checkpoint/restart.
In conjunction with our research in reliability, availability, and serviceability (RAS) for high-performance computing (HPC) systems, we developed proof-of-concept prototypes for proactive fault tolerance using (1) process-level and (2) virtual-machine-level preemptive migration. Further work focused on the reliability-aware runtime framework for proactive fault tolerance.
Abstract,
Publication,
Presentation,
Citation,
DOI)



Symmetric active/active replication allows to provide high availability for any type of networked service using the well known state-machine replication concept that relies on a group communication system for totally ordered and reliably delivered messages in a virtual synchronous service group.
In the symmetric active/active replication replication model, two or more active services offer the same capabilities and maintain a common global state. It is based on guaranteeing the same initial states and a linear history of state transitions for all active services, i.e., virtual synchrony. State replication is performed by totally ordering all incoming messages and reliably delivering them to all services. A process group communication system is used to assure total message order, reliable message delivery, and group membership management. Consistent output messages produced by all correct services is unified either by simply ignoring duplicated messages or by using the group communication system for a distributed mutual exclusion.
Virtual synchrony requires that each replicated service performs the same order of state transitions based on the same order of incoming messages, which are delivered to each service by a group communication system. Adaptation of the service to this programming model can be performed either internally by modifying the service itself or externally by wrapping it into a virtually synchronous environment.
Internal replication allows each service to accept messages individually, while using the group communication system for total message order and reliable message delivery to all members of the group. This method requires modification of existing code, which may be unsuitable for complex and/or large services. However, it may lead to performance enhancements as it implies fine-grain synchronization of state transitions.
External replication avoids modification of existing code by wrapping a service into a virtually synchronous environment. Interaction with dependent services and users is intercepted, totally ordered, and reliably delivered to each service using the group communication system to mimic the service interface. This method not only does not modify existing service code, it also allows reusing the same solution for different services with the same interface. However, it may lead to performance degradation as it implies coarse-grain synchronization of state transitions.
As part of our research in high availability solutions for high-performance computing (HPC) system services, we developed two symmetric active/active replication proof-of-concept prototypes: (1) for the batch job management system, TORQUE using external replication, and (2) for the parallel virtual file system (PVFS) metadata server using internal replication. Assuming a mean-time to failure of 5,000 hours for a service node, these prototypes improve service availability from 99.285% to 99.994% in a two-node system, and to 99.99996% with three nodes.
Abstract,
Publication,
Presentation,
Citation,
DOI)



A virtual system environment (VSE) is a "sandbox" operating system and runtime (OS/R) environment provided by hypervisor virtualization technology. The capabilities of a VSE range from offering "plug-and-play" supercomputing to on-demand OS/R customization and adhoc testbeds.
The VSE concept enables "plug-and-play" supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor virtualization technology. By allowing customization of the OS/R environment to application needs, instead of the traditional adaptation (porting) of applications to system properties, applications can seamlessly move from development environments (desktops and small clusters) to various large-scale production environments by also moving the OS/R environment itself or its key properties. The overall goal is to advance the race for scientific discovery through computation by enabling day-one operation capability of newly installed systems and by improving productivity of scientific application development and deployment.
Additionally, the on-demand deployment capability of a VSE allows to change the OS/R environment on a job-by-job basis. This feature, in conjunction with the isolation properties of hypervisor virtualization, permits the on-demand deployment of a testbed system software stack within a VSE for large-scale testing of operating system, libraries, middleware and applications on production-type HPC systems. The encapsulation of a testbed run in a job on a production-type HPC system that is normally used for production computation eliminates the need for separate large-scale testbed HPC systems and advances computer science in conjunction with computational science.
Abstract,
Publication,
Presentation,
Citation,
DOI)



