home  |  about us  |  contact  

This page is part of the CSMD web archive and is not maintained.
Please visit csm.ornl.gov/newsite for the latest CSMD information.


Home > PC >

Originally appeared in November 2, 2005

La. Tech Scores Breakthrough in HPC Clusters

Louisiana Tech University's eXtreme Computing Research (XCR) group has unveiled a breakthrough development in the RAS-ware runtime for transparent job queue fault tolerance in HPC cluster environments.

According to Chokchai "Box" Leangsuksun, an associate professor in computer science at Louisiana Tech, XCR's breakthrough consists of high availability, self-configuration and self-healing as enabling solutions. His group of graduate students, led by Anand Tikotekar and Kshitij Limaye, has implemented a proof-of-concept Beowulf cluster based on HA-OSCAR 1.1 and standard HPC resource management/job queue system (e.g, PBS/TORQUE).

Preliminary results suggest that MPI jobs can continue their execution, and job queue is preserved regardless of failures at the head node and compute nodes. The experiment runs standard MPI jobs without any modification under LAM/MPI 7.0.

The breakthrough handles both running and queued jobs transparently and the queue order is even maintained in the face of a catastrophic failure. HA-OSCAR multi-head solution provides failover capability and transparently recovers the job queue in a head-node outage event.

"This is very exciting for us," said Leangsuksun. "This marks a major milestone in our overarching goal -- toward non-stop services in HPC environment. We expect that our breakthrough technology is exactly what the community has been waiting for. Our breakthrough is also expected to be part of the next HA-OSCAR release that will have broad impacts in HPC and telecom cluster environments, especially for mission-critical applications."

This RAS-aware runtime breakthrough was a result of the MOLAR project (http://fastos.org/molar/) under collaboration between Louisiana Tech's eXtreme Computing Research group and the Network and Cluster Computing group at Oak Ridge National Laboratory.

A demo will be shown in two weeks at SC05 in booth No. 218.

HA-OSCAR is an open source project. Leangsuksun is the chief architect and project director of the HA-OSCAR research and development program at Louisiana Tech. The research and development program is supported and funded by Office of Science, Department of Energy contract DE-FG02-05ER25659.


Copyright 1993-2004, HPCwire. All Rights Reserved.
Mirrored with permission.