Abstract
The need for leadership class fault-tolerance has steadily increased and
continues to increase as emerging high performance systems move towards
offering petascale level performance. While most high-end systems do
provide mechanisms for detection, notification and perhaps handling of
hardware and software related faults, the individual components present in
the system perform these actions separately. Knowledge about occurring
faults is seldom shared between different programs and almost never on a
system-wide basis. A typical system contains numerous programs that could
benefit from such knowledge, include applications, middleware libraries,
job schedulers, file systems, math libraries, monitoring software,
operating systems , and check pointing software. The Coordinated
Infrastructure for Fault Tolerant Systems (CIFTS) initiative provides the
foundation necessary to enable systems to adapt to faults in a holistic
manner. CIFTS achieves this through the Fault Tolerance Backplane (FTB),
providing a unified management and communication framework, which can be
used by any program to publish fault-related information. In this talk, I
will present some of the work done by the CIFTS group towards the
development of FTB and FTB-enabled components.
Bio
Rinku Gupta is a senior scientific developer at Argonne National
Laboratory and the lead developer for the Fault Tolerance Backplane
project. She received her MS degree in Computer Science from Ohio State
University in 2002. She has several years of experience developing systems
and infrastructure for enterprise high-performance computing. Her research
interests primarily lie towards middleware libraries, programming models
and fault tolerance in high-end computing systems.