Computer Science at ORNL involves extreme scale scientific simulations through research and engineering efforts advancing the state of the art in algorithms, programming environments, tools, and system software. ORNL's work is strongly motivated by, and often carried out in direct collaboration with computational scientists and engineers, and targets the largest parallel computers.
Engineering of Scientific Software (contact David Bernholdt)
The development and use of modern computational science and engineering software offers complex challenges, especially on high-end massively parallel computers. Within increasingly multidisciplinary research teams, computer scientists help address issues related to the programmability, performance, resilience, usability, and other aspects of scientific software. While each application has many distinctive aspects, there are also many commonalities. Computer scientists interested in the engineering of scientific software, therefore, operate at multiple levels. "Embedding" in project teams allows us to work deeply with a given application, addressing issues which may be specific to that application, or may be more general. By working across multiple applications, often over time, we have the opportunity to transfer ideas and solutions from one domain to another, and we develop the breadth of experience to identify the abstractions that are common to many disciplines, and points where specialization is required. In this way. over time, we can continually improve our approaches and tools for engineering software.
Because the challenges to engineering scientific software are widely varied, so too are the associated computer science research results. A few recent examples in the Computer Science and Mathematics Division include:
- The development of software architecture principles and related software infrastructure to facilitate the flexible assembly of complex modeling and simulation codes from smaller modules using component and service-oriented architecture (SOA) approaches.
- Simplifying the use of complex software by developing tools that help automate scientific workflows, and graphical front-end environments for the creation, execution, monitoring, and post-analysis of simulations.
- Developing software frameworks and new algorithms which allow applications to expose and utilize greater levels of concurrency in order to scale better and make better use of modern massively parallel computers.
- Developing techniques for intelligent restart of failed tasks in coupled multiphysics simulations to provide better resource utilization and reduced turnaround time.
- Tools and techniques to increase the portability and performance, of applications across a variety of computational platforms, often with very different architectures and performance characteristics.
- Tools and techniques to simplify the programming of complex applications on diverse hardware platforms.
Our interactions with scientific applications also often help to motivate and validate many of our research thrusts, such as resilience, program translation, and performance engineering.
Resilience (contact Christian Engelmann)
Hardware and software faults are an unavoidable aspect of any computer system and their management accounts for a great deal of effort at all levels in the system. While faults occur continuously they are only significant if they result in an interruption of work or in a wrong answer. Resilience is about keeping a computer system running and producing correct output in a timely manner. As a supercomputer consists of millions of individual components, including hundreds-of-thousands of processor and memory chips, the probability of faults is much higher than in a consumer product. Supercomputers also constantly push the envelope in what is achievable with today's technology, such as by relying on the latest accomplishments in processor and memory technology, additionally increasing the potential for faults. The goal of resilient high-performance computing (HPC) is to provide efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery across all levels of HPC hardware and software. Our research and development in HPC resilience focuses on fault characterization, prevention, detection, notification, and handling as part of HPC hardware/software co-design that considers the cost/benefit trade-off between the key system design factors: performance, resilience, and power consumption. Our efforts target (1) fault injection tools to study the vulnerability, propagation properties, and handling coverage of processors, memory, system software, and science applications, (2) fault detection and notification software frameworks for communicating information across all levels of the system, (3) reactive software mechanisms, like checkpoint/restart and message logging, (4) proactive software approaches, such as migration of work in anticipation of faults, reliability-aware scheduling, and rejuvenation, (5) programming model approaches, like the fault-tolerant Message Passing Interface, (6) algorithm-based fault tolerance with recovery or fault-oblivious approaches embedded in science applications, and (7) resilience co-design tools to study the cost/benefit trade-off between the key system design factors. Our work in HPC resilience, including the obtained knowledge and the developed solutions, ensures that DOE's Leadership computing systems continue to enable scientific breakthroughs by operating with an acceptable efficiency and productivity.
Program Translation (contact David Bernholdt)
Program translation is central to nearly all computing. In its simplest and most familiar form, a compiler translates a program into executable code. But similar techniques can be used to manipulate programs in other ways, transforming one program into another at the source code level or into other intermediate representations, as well as analyzing them in various ways.
Translation approaches most often target the programmability and the performance of software, but can also impact resilience, verification and debugging, and other important aspects of software development. A common feature of most translation-based approaches is that they are based on compilers or compiler-like tools that can process the code with appropriate awareness of the syntax and semantics of the programming language used.
Program translation research in the Computer Science and Mathematics Division spans a broad range of topics. Research on optimization and code generation techniques helps produce code that runs faster or more efficiently on the target platform with constrained power or other resources. Extending programming languages (including, for example, the OpenMP and OpenACC directive languages) to make new abstractions available allows programmers to express their computations more succinctly, while allowing the compiler to generate better code. Although there are already a lot of programming languages, there are sometimes good reasons to create new ones, especially small languages specialized to address key needs of scientific applications. One example is a directive language which can be embedded in a traditional language, like C, C++, or Fortran, allowing the programmer to express inter-process communications in a simple abstract form, which can be translated and optimized to efficiently target multiple communications libraries, such as one- or two-sided MPI, OpenSHMEM, or others. Another example is domain-specific languages (DSLs), which provide specialized constructs tailored to facilitate expressing complex programs in a given scientific or problem domain. While providing the programmer with potentially huge benefits in programmability and correctness, DSLs also provide a means to capture and convey key domain knowledge to the compiler system to allow it to perform domain-specific optimization and generate better code -- information that is unavailable, and cannot be inferred in a general-purpose language environment.
The program translation approach can also be used to help understand programs, and to transform them into new programs. For example, the maintenance of large-scale programs, and porting them to new hardware platforms is often tedious and error-prone when done by hand. We are, therefore, studying use of translation tools to identify code patterns of interest and refactor them in an automated fashion. We are also interested in approaches that will allow us to more effectively capture and transfer the experience of porting and tuning one code to a new platform to other applications needing to make the same transition.