The largest computer systems have entered the era of Peta operations per
second and will climb to Exa-operations per second over the next decade,
largely on the strength of more cores per chip and more chips per system.
The inevitable consequence of increasing component counts is more parts that
can fail, higher failure rates, more concurrent failures and more effort
devoted to coping with and recovering from failures -- a key role for
storage systems. In this talk I will review historical data on failure rates
in supercomputers to project future failure rates, review growing
limitations on traditional fault tolerance strategies for supercomputers
based on high-speed checkpointing to parallel storage systems, and address
the increasing failure issues in storage components.
Garth Gibson is a professor of Computer Science and Electrical and
Computer Engineering at Carnegie Mellon University (CMU) and co-founder
and Chief Technology Officer at Panasas Inc. Garth received a Ph.D. in
Computer Science from the University of California at Berkeley in 1991.
While at Berkeley he did the groundwork research and co-wrote the seminal
paper on RAID, then Redundant Arrays of Inexpensive Disks, for which he
received the 1999 IEEE Reynold B. Johnson Information Storage Award for
outstanding contributions in the field of information storage. Joining
CMU's faculty in 1991, Garth founded CMU's Parallel Data Laboratory
(www.pdl.cmu.edu), academiaâs premiere storage systems research center,
and co-led the Network-Attached Storage Device (NASD) research project
that became the basis of the recently standardized T10 (SCSI) Object-based
Storage Devices (OSD) command set for storage. At Panasas
(www.panasas.com) Garth led the development of the ActiveScale Storage
Cluster in use in government and commercial high-performance computing
sites, including the worldâs first Petaflop computer, Roadrunner, at Los
Alamos National Laboratory. Panasas products provide scalable performance
using a simply managed, blade server platform. Through Panasas, Garth
co-instigated the IETF's emerging open standard for parallelism in the
next generation of Network File Systems (NFSv4.1). Garth is also principal
investigator of the Department of Energy's Petascale Data Storage
Institute (www.pdsi-scidac.org) in the Scientific Discovery through
Advanced Computing program and co-director of the Institute for Reliable
High Performance Information Technology, a joint effort with Los Alamos.
Garth has sat on a variety of academic and industrial service committees
including the Technical Council of the Storage Networking Industry
Association and the program and steering committee of the USENIX
Conference on File and Storage Technologies (FAST).