|
Participating Institutions: Oak Ridge National Laboratory North Carolina State University Virginia Tech. |
CenterSS: HPC Center Storage as a System |
                    People                     Publications                     Testbed                     Positions |

Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge that will significantly impact their usability. As recent research shows, storage system faults, data unavailability and I/O bandwidth bottlenecks can cause even today.s supercomputers to fail. These problems significantly impact common supercomputing I/O operations such as data staging, offloading, checkpointing and prefetching, leading to sub-optimal HPC center performance, increased job turnaround time, frequent resubmissions and poor use of precious center resources as well as user.s allocated time. Solving these issues is highly critical to scaling to Petascale systems.
Modern HPC centers and users. job workflow offer numerous opportunities for significant improvements along the storage hierarchy that have gone unnoticed. To this end, we propose a fresh look at the HPC storage crisis with an eye toward virtualizing the entire center as a system. In this setting, we propose to perform the following: (i) global coordination and scheduling of data and computational activities, (ii) the construction of novel storage abstractions using untapped storage resources available in the machine room and (iii) their conjoined use with each other and traditional storage elements.
The novelty of our research lies in viewing the HPC center as a system and performing scheduling across its resources, building new abstractions from its under-utilized resources and communicating across them in a scalable fashion. Key benefits due of our approach are as follows. First, coordinated scheduling aids in the optimal use of precious HPC center resources and can radically reduce job resubmissions due to data errors, leading to better center availability and serviceability. This also improves users. job turnaround time. Second, the new storage abstractions will serve as an excellent candidate for intermediate storage for a variety of supercomputing I/O. This offers the ability to alleviate the I/O bandwidth bottleneck for key operations such as checkpointing. In addition, these storage abstractions can mask failure and improve fault tolerance of data activities, such as staging and result data offloading.



NEW!: ORNL: Internships available all through the year
