A coordinated infrastructure for Fault Tolerant Systems (CIFTS)
Rinku Gupta, Argonne National Laboratory
[Slides]

Abstract
The need for leadership class fault-tolerance has steadily increased and continues to increase as emerging high performance systems move towards offering petascale level performance. While most high-end systems do provide mechanisms for detection, notification and perhaps handling of hardware and software related faults, the individual components present in the system perform these actions separately. Knowledge about occurring faults is seldom shared between different programs and almost never on a system-wide basis. A typical system contains numerous programs that could benefit from such knowledge, include applications, middleware libraries, job schedulers, file systems, math libraries, monitoring software, operating systems , and check pointing software. The Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) initiative provides the foundation necessary to enable systems to adapt to faults in a holistic manner. CIFTS achieves this through the Fault Tolerance Backplane (FTB), providing a unified management and communication framework, which can be used by any program to publish fault-related information. In this talk, I will present some of the work done by the CIFTS group towards the development of FTB and FTB-enabled components.

Bio
Rinku Gupta is a senior scientific developer at Argonne National Laboratory and the lead developer for the Fault Tolerance Backplane project. She received her MS degree in Computer Science from Ohio State University in 2002. She has several years of experience developing systems and infrastructure for enterprise high-performance computing. Her research interests primarily lie towards middleware libraries, programming models and fault tolerance in high-end computing systems.

Workshop Index