Failure in Supercomputers and Supercomputer Storage
Garth Gibson, Carnegie Mellon University / Panasas, Inc.
[Slides]

Abstract
The largest computer systems have entered the era of Peta operations per second and will climb to Exa-operations per second over the next decade, largely on the strength of more cores per chip and more chips per system. The inevitable consequence of increasing component counts is more parts that can fail, higher failure rates, more concurrent failures and more effort devoted to coping with and recovering from failures -- a key role for storage systems. In this talk I will review historical data on failure rates in supercomputers to project future failure rates, review growing limitations on traditional fault tolerance strategies for supercomputers based on high-speed checkpointing to parallel storage systems, and address the increasing failure issues in storage components.

Bio
Garth Gibson is a professor of Computer Science and Electrical and Computer Engineering at Carnegie Mellon University (CMU) and co-founder and Chief Technology Officer at Panasas Inc. Garth received a Ph.D. in Computer Science from the University of California at Berkeley in 1991. While at Berkeley he did the groundwork research and co-wrote the seminal paper on RAID, then Redundant Arrays of Inexpensive Disks, for which he received the 1999 IEEE Reynold B. Johnson Information Storage Award for outstanding contributions in the field of information storage. Joining CMU's faculty in 1991, Garth founded CMU's Parallel Data Laboratory (www.pdl.cmu.edu), academia’s premiere storage systems research center, and co-led the Network-Attached Storage Device (NASD) research project that became the basis of the recently standardized T10 (SCSI) Object-based Storage Devices (OSD) command set for storage. At Panasas (www.panasas.com) Garth led the development of the ActiveScale Storage Cluster in use in government and commercial high-performance computing sites, including the world’s first Petaflop computer, Roadrunner, at Los Alamos National Laboratory. Panasas products provide scalable performance using a simply managed, blade server platform. Through Panasas, Garth co-instigated the IETF's emerging open standard for parallelism in the next generation of Network File Systems (NFSv4.1). Garth is also principal investigator of the Department of Energy's Petascale Data Storage Institute (www.pdsi-scidac.org) in the Scientific Discovery through Advanced Computing program and co-director of the Institute for Reliable High Performance Information Technology, a joint effort with Los Alamos. Garth has sat on a variety of academic and industrial service committees including the Technical Council of the Storage Networking Industry Association and the program and steering committee of the USENIX Conference on File and Storage Technologies (FAST).

Workshop Index