SRT's Website

Last Modification: Nov-02, 2009
Home | Projects | Downloads | Workshops/Conferences | SRT Seminars | People | Publications

SRT
  • Home
  • People
  • Contact
Research
  • Projects
  • Software
  • Publications
Workshops and
Conferences
  • Index
Seminars
  • Index

Geoffroy R. Vallee's ORNL Webpage

Personnal Data

  • Last Name: Vallee
  • First Name: Geoffroy
  • Office Address: Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
  • E-mail: valleegr at ornl dot gov

Software

I am the chair and main developer of the OSCAR project: http://oscar.openclustergroup.org/.
I designed and developed the OSCAR-V software (http://www.csm.ornl.gov/srt/oscarv/) and v2m software (http://www.csm.ornl.gov/srt/v2m/).
I am also technical coordinator for the development of a scalable and fault-tolerant process management infrastructure (STCI project).

Research Interests

Since September 2008, I am Research Associate at the Oak Ridge National Laboratory (ORNL), USA, in the Computer Science Research Group, USA. My research interest is systems for high-performance computing, including run-time system, system resilience, operating systems, and system-level virtualization.

My position at the Oak Ridge National Laboratory allows me to work on the following projects:

  • a scalable and fault tolerance process management infrastructure for petascale and exascale (STCI).
  • Enabling Exascale Hardware and Software Development Through Scalable System Virtualization, that focus on developing new system virtualization techniques to accelerate design, development, and usability of exascale systems.

In 2005 and 2006, I have also been technical expert at the European Commission for the review of a project funded by the European Commission.

From March 2004 to September 2005, I had been an industrial PostDoctorant at INRIA, co-funded by Électricité De France Research and Development (EDF R/D). I carry on methods for building, programming, and using clusters, developing a fully integrated and easy to install software bundle designed for high performance cluster computing. A solution to offer ease programming and ease use of clusters is to used a Single System Image (SSI) as cluster system. So, I work on the creation of a SSI package for the OSCAR, a cluster distrubution which provides a snapshot of the best known methods for building, programming, and using clusters. For that, I have integrated the OSCAR team at the Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA, during one year in the framework of my PostDoctorant position. I also worked on the OSCAR on Debian project, a port of the OSCAR distribution on the Debian Linux distribution.

I have made my Ph.D. at the University of Rennes 1, funded by Électricité De France Research and Development (EDF R/D). During my Ph.D., I carry on my research activities at IRISA/INRIA in the PARIS project-team (http://www.irisa.fr/paris). I worked on global scheduling and global process management in Kerrighed, an operating system for high performance computing on clusters.

Past Projects

Institute for Advanced Architectures and Algorithms (IAA)

In the next few years, tremendous increases in supercomputer capability will revolutionize the way science is done, and predictive computer simulations will play a critical role in national security, energy, scientific discovery, and national competitiveness. The dramatic increase in computing power at the microprocessor level will be driven by a rapid escalation in the number of cores incorporated into a single chip rather than increases in clock rate. The transition from massively parallel architectures to multi-core architectures will be as profound and challenging as the change from vector architectures to massively parallel computers that occurred in the early 1990's that enabled our Nation and the U.S. Department of Energy to break the teraflop barrier. To effectively use the next generation of computers the nation must solve a host of architectural challenges in hardware and software.
http://www.csm.ornl.gov/iaa/

Virtualized System Environments for Petascale Computing and Beyond

Description: This research project intends to address scalability, manageability, and ease-of-use chal- lenges in petascale system software and application runtime environments through the development of a virtual system environment (VSE). In addition to providing a scalable and reliable "sandbox" environment for scientific application development on desktops and clusters, the VSE will offer an identical production environment for scientific application deployment on terascale and petascale HEC systems. The VSE concept enables "plug-and-play" supercomputing through desktop-to-cluster-to-petaflop computer system-level virtualization based on recent advances in hypervisor virtualization technologies. The overall goal of this effort is to advance the race for scientific discovery through computation by enabling day-one operation capability of newly installed systems and by improving productivity of scientific application development and deployment.

  • Source: Oak Ridge National Laboratory
  • Program: Laboratory Directed Research and Development (LDRD); Ultrascale Computing Initiative
  • Investigators: Oak Ridge National Laboratory: S. L. Scott, H. Ong, C. Engelmann, G. Vallee, R. A. Kendall, ; Sandia National Laboratory: R. Brightwell; University of New Mexico: A. B. Maccabe

MOLAR: Modular Linux and Adaptive Runtime Support for High-End Computing

Description: This project is a multiinstitution research effort that concentrates on adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. It addresses the challenges outlined by the FAST-OS - forum to address scalable technology for runtime and operating systems - and HECRTF - high-end computing revitalization task force - activities by providing an adaptable runtime support for high-end computing operating and runtime systems. This research primarily concentrates on advancing computer reliability, availability and serviceability (RAS) management systems to run large and long-running applications efficiently on future ultra-scale computers, and on providing advanced monitoring and adaptation mechanisms for improved application performance and predictability. For more information, please visit www.fastos.org/molar.

  • Source: Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
  • Program: Operating/Runtime Systems for Extreme Scale Scientific Computation (LAB 04-13)
  • Investigators: Oak Ridge National Laboratory: S. L. Scott, J. S. Vetter, D. E. Bernholdt, C. Engelmann; Louisiana Tech University: C. Leangsuksun; Ohio State University: P.Sadayappan; North Carolina State University: F. Mueller

PetaScale Single System Image

Description: Peta-scale computers with thousands of times more computational power will be available to scientists by the end of this decade. The research and development of these next generation computing architectures and the corresponding computing environments will accelerate scientific discoveries within the Office of Science as well as within the high-end computing community in general. The overwhelming size and complexity of a peta-scale system will require a computing environment that addresses the scalability of the file system, a network that facilitates communications across the scale of processors, and an aggressive approach for dealing with operating system noise. We believe the best way to reach this environment is to scale a single system image Linux environment to 100,000 processors. We propose fundamental research and development using the Open Single System Image (OpenSSI) project as a baseline to provide a balanced solution to accomplish the goals outlined in this proposal. OpenSSI is an open source project maintained by Bruce Walker (Co-PI of this proposal) which provides a single root file system and single process space across distributed resources. OpenSSI is a full implementation of the single system image and has a more fault tolerant peer-to-peer communication system than other implementations available.

  • Grant Source: Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
  • Program: Operating/Runtime Systems for Extreme Scale Scientific Computation (LAB 04-13)
  • Investigators: Rice University: Alan Cox; Oak Ridge National Laboratory: R. Scott Studham; HP: Bruce Walker; CFS: Peter Braam; Rice University: Peter Druschel, Scott Rixner

Kerrighed: A Single System Image for High-Performance Computing

Description: Today, clusters are widely used to compute a wide range of high performance scientific applications which may be sequential or parallel. Previous studies show that different kinds of application workloads and cluster usages imply different needs in resource management, and thus different scheduling policies. In my thesis work I have proposed a modular architecture to provide an adaptive global scheduler and a development framework of new policies allowing the integration in Kerrighed of traditional scheduling policies as well as any new policy. The proposed global scheduler is based on mechanisms for dynamic configuration and hot-load / hot-eviction of scheduling modules. A development toolkit for scheduling policies has also been implemented to simplify the development of new policies. The global scheduler is based on mechanisms to efficiently manage processes in a cluster. I have proposed a mechanism of \textit{ghost process} allowing the implementation of process migration, remote process creation and process checkpoint and restart. This mechanism takes advantage of the Kerrighed's distributed shared memory for thread management. Based on these mechanisms, the Kerrighed operating system provides a pthread interface, allowing in particular the execution of OpenMP applications on a cluster. The process checkpoint/restart mechanism has been used in backward error recovery protocols for parallel applications, designed in collaboration with Ramamurthy Badinath, Associate Professor at the Indian Institute of Technology of Kharagpur. A coordinated checkpointing strategy for shared memory parallel applications has been implemented in the Kerrighed operating system. All the mechanisms and algorithms that I have proposed have been integrated in a prototype of Kerrighed operating system. Kerrighed is an open source software available at the following URL http://www.kerrighed.org. My work was validated by the experimentation of industrial applications provided by EDF R/D on top of Kerrighed. The Kerrighed project is now lead and developed by the KerLabs company (http://www.kerlabs.com/).

  • Program: Operating System Research Team, Paris Project-Team, IRISA
  • Investigators: Institut National de Recherche en Informatique et en Automatisme: Christine Morin, David Margery; KerLabs: Pascal Gallard, Renaud Lottiaux, Louis Rilling; Electricite de France: Jean-Yves Berthou; Indian Institute of Technology: Ramamurthy Badrinath

Publications

The list of my publications is available here: http://www.csm.ornl.gov/srt/people/gvallee/publications/index.html.


ORNL Logo
Computer Science and Math Division | Oak Ridge National Laboratory | ORNL Disclaimer