by James Arthur Kohl

ORNL Review, Special Issue on Advanced Computing, Vol. 30, Nos. 3 & 4, 1997.

ORNL researchers "push the envelope" to put heavy-duty computer power within scientists' grasp using the CUMULVS system.

Scientists today are frequently turning to high-performance computer simulation as a more cost-effective alternative to physical prototyping or experimentation. The on-line nature of computer simulation encourages more flexible collaboration scenarios for researchers across the country, and the interactivity that is possible can often shorten the design cycle. Yet there are many issues to overcome when developing a high-performance computer simulation, especially when many scientists will interact with the simulation over a wide-area network. The scientists must be able to observe the progress of the simulation and coordinate control over it, and the user environment must be capable of withstanding or recovering from system failures. The efficient handling of these issues requires a special expertise in computer science and a level of effort higher than the typical application scientist is willing to expend.

CUMULVS is a new software system developed at ORNL by James Kohl and Philip Papadopoulos that bridges the gap between scientists and their computer simulations. CUMULVS allows a collaborating group of scientists to attach dynamically to an ongoing computer simulation, for viewing the progress of the computation and for controlling the simulation interactively. CUMULVS also provides a framework for integrating fault-tolerance into the simulation program, for saving user-directed checkpoints, migrating the simulation program across heterogeneous computer architectures, automatically restarting failed tasks, and reconfiguring the simulation while it is running.

With CUMULVS, each of the collaborators can start up an independent {\em viewer} program that will connect to the running simulation program. This viewer allows the scientists to browse through the various data fields being computed and observe the ongoing convergence toward a solution. Especially for long-running simulations, this can save countless hours waiting for results that might have gone awry in the first few moments of the run. Once connected, the scientists each view the simulation from their own perspective, receiving a steady sequence of "snapshots" of the program data that is of interest to them. CUMULVS supports a variety of visualization systems for viewing these snapshots, from commercial packages, such as AVS, to public domain interfaces, such as Tcl/Tk (developed at SUN Microsystems). CUMULVS takes care of the many complicated tasks of collecting information and coordinating communication between the various viewers and the simulation program. The end result is a single, simple picture of the computation space, presented as a uniform field of data even if the actual data are distributed across a set of parallel tasks.

Scientists can use CUMULVS for collaborative "computational steering," where certain parameters of a physical simulation or of an algorithm can be adjusted while the program is running. In a typical scenario, several scientists would attach to a simulation to view its progress. One of them might discover that something has gone wrong or is heading in the wrong direction. At this point the scientist could adjust or "steer" certain physical or algorithmic features to try to fix the simulation, or the simulation could simply be restarted with a new set of inputs. This type of interaction can save immense amounts of time by shortening the experimentation cycle. The scientist need not wait for the entire simulation to complete before making the next adjustment. CUMULVS provides mechanisms that allow groups of scientists to cooperatively manipulate a simulation, with automatic locking capabilities that are invoked to prevent conflicting steering requests for any single parameter. While only one scientist at a time can adjust the value of any single parameter, any number of different parameters can all be adjusted simultaneously.

To interact with a simulation using CUMULVS, the simulation program must be instrumented to describe the primary computational data fields and algorithmic or physical parameters. These special declarations consist of the data type, the size and cardinality of arrays, and any distributed data decompositions. CUMULVS needs declarations only for the data that is to be viewed and the parameters that are to be steered. At the point in the simulation where the data values are deemed "valid," a single library call is made to temporarily pass control to CUMULVS. Here, any pending viewer requests are handled and any steering parameter updates are processed. If no viewers are attached, this library call carries only the overhead to check once for a incoming message, so that negligible intrusion to the simulation is experienced.

The various communication protocols that CUMULVS uses to coordinate the interactions between the viewers and the simulation are tolerant to computer faults and network failures. In addition, CUMULVS provides a "checkpointing" mechanism for making the simulation program itself fault-tolerant. Using this mechanism, the state of a simulation program can be saved periodically. Then the data stored in the checkpoint can be used to automatically restart the simulation if a computer node should crash or if a network should fail. CUMULVS checkpoints can also be used to migrate parallel simulation tasks across heterogeneous computer architectures on-the-fly. This is not typically possible with traditional checkpointing schemes, but CUMULVS uses a special "user-directed" checkpointing approach. Because the scientist has precisely described the data in the simulation program, CUMULVS has the additional semantic information necessary to automatically migrate a task using a minimal amount of information. CUMULVS can save the state of a simulation task and then restore it, even if the new task's computer is of a different architecture or data format. Beyond that, CUMULVS can actually reconfigure an entire application by re-organizing the checkpoint data to fit a new data decomposition. So, a checkpoint saved on a cluster of workstations can be restarted to continue executing on a large parallel machine with many more nodes, or vice versa.

To better understand how these CUMULVS features are of use, consider a sample scenario. Suppose you're an engineer on a big project team to design a new high-tech jet airplane. Your job is to make sure that the air flows smoothly over the wings and around the engine intakes. If your design is off by even a small amount, you might bring down a multi-million dollar aircraft, not to mention making a lifelong enemy out of the poor pilot...

What you need is some way to really test out your ideas, try things out a few different ways -- make sure you've got it right before they start forming those expensive prototype airfoils. So you put your expertise to work, sit down at the computer and "whip up" a computational fluid dynamics (CFD) simulation of the air flowing around one of your wings. You sit back and patiently wait for the results. And you wait. And you wait some more. Whew! Finally, after many hours of waiting for your program to converge on a solution, you get the answer. But something went wrong. Your wing design didn't produce the smooth flow you expected. "What happened?"

With CUMULVS at your side, the problems with your wing simulation can easily be revealed. You decide to apply CUMULVS to the simulation program to see what's going wrong. In your CFD air flow program you add a few special declarations so that CUMULVS knows what's in your simulation and where. You give your simulation a name, "flow," so your viewer can find it. You describe the main computational data fields, "pressure" and "density." You also declare a few of the steerable parameters, like the "Mach number" and the wing's "angle of attack." After recompiling your program, you are ready to go!

This time, you start up your simulation and can immediately attach a CUMULVS viewer to see what's happening. You request the main "pressure" data field -- it's huge, but you want to see an overview of the whole array anyway. You tell CUMULVS to view the entire region of computation but at a coarse granularity, showing only every tenth data point. Using this smaller collection of data points for viewing greatly reduces the intrusion to your simulation, and the load on your network, while exploring such an immense dataset. The CUMULVS view of "pressure" appears as requested, and you begin to watch it slowly change as the simulation proceeds. From this high-level view, you can already see that something isn't quite right. It looks like the angle of attack of the wing is off by a mile. But it turns out to be a simple program bug and you fix it in no time.

Now you're ready to try again, so you start up the simulation and connect up with CUMULVS. Your wing looks much better now, so you disconnect your viewer and let the simulation run while you go out to lunch. When you get back, you connect up again to see how things are going. The simulation is working, getting closer to an answer, but you can see that the performance of the wing will not be as good as was hoped. While watching through your viewer, you tell CUMULVS to adjust one of the wing model parameters to see if you can improve the design. After a moment you see the changes to the wing appear in your viewer, but it takes several more iterations before the effects of the new information begin to be seen in the simulation results.

After tediously tweaking your model over the next few hours, you decide that your simulation program is just too darn slow. You need to split your simulation program into smaller independent pieces that can run simultaneously, or "in parallel," so you can get the job done faster. With several different computers all working together on the problem your simulation program might run in minutes instead of hours. You use a system like MPI or PVM \footnote{PVM was developed jointly at Emory University, Oak Ridge National Laboratory, and the University of Tennessee, by a team led by Al Geist at ORNL and Jack Dongarra at ORNL and the University of Tennessee. Geist and Dongarra have also been instrumental in the development of the MPI message-passing standard.} that allows you to write a "parallel program." You parallelize the CFD algorithm by breaking the original calculation down into "cells," and then you assign sets of these cells to a collection of parallel "tasks" that will cooperate to solve your problem. After each task finishes its iteration of work, the tasks will talk to each other, sending messages among themselves to share intermediate results on the way to a solution.

You start up a run of your new parallel program on a few workstations and, sure enough, you get the answer back in a fraction of the time. But things are way off again, and it's worse than before. There's a huge region of turbulence off the tip of the wing. Something has gone wrong with the way you reorganized your simulation program. "Now What?!" It's time for CUMULVS again.

Parallel programming is notoriously difficult because there isn't a single thread of control as you would have in a conventional serial program. In addition, the data used in your parallel computation are likely spread across a number of distributed computer systems. This is done to capitalize on {\em locality} by leveraging faster local data accesses against more costly remote data accesses. Often, the data "decompositions" that make your parallel program the fastest are the ones that are the most complicated. CUMULVS helps a great deal with these complex data decompositions because it "un-jumbles" the data and presents it to the scientist in its original form, as if all of the data were present on a single computer.

To help CUMULVS collect the parallel data in your wing simulation, you need to enhance the special declarations that you made for the serial version. This involves defining the way each data field has been broken apart and distributed among the parallel tasks. After adding a few extra lines of decomposition declarations for CUMULVS, you are ready to go.

So you start up a fresh parallel wing simulation and then attach your CUMULVS viewer to see what is happening. Sure enough, almost immediately you begin to see turbulence form off the wing tip. But it's still not clear to you why it's there. Something must be wrong with the parallel algorithm. You go back into your simulation code and add some CUMULVS declarations for the "residuals" array. This array doesn't represent the physical model, but instead describes the error associated with your mathematical computation. Viewing the residuals data field with CUMULVS should indicate whether there is an algorithmic problem.

This time when you connect your viewer, you request the residuals data field. "A-Ha!" It looks like the residual error in one corner of your computation space is "stuck." One column of the space is not being updated from iteration to iteration, causing a fixed boundary condition that appears as turbulence in the model. You must have figured your array bounds wrong for that pesky parallel data decomposition....

With the bug fixed, you are now up to full speed and back in the task at hand -- designing that airplane wing. The parallel simulation runs much faster than the serial version, and you are able to very quickly try a wide range of variations on your design. After a few days of successful experimentation, suddenly everything clicks and you arrive at a solid arrangement of the wing parameters. You excitedly call up one of your colleagues and ask her to take a look at your design. She hooks up her CUMULVS viewer to the parallel simulation running on your workstation cluster, and checks out your new design. "This looks great!" She decides to try some minor enhancements, so she takes the last CUMULVS checkpoint saved for your simulation and reconfigures it to run on her big 512-node multiprocessor system.

She picks up where you left off on your workstation cluster and begins to explore some slight adjustments to your model. She steers one wing parameter, just a little bit, to smooth off a final rough edge, and then calls you up to look at the simulation. You connect another viewer to her simulation and immediately agree that the wing is even better. The two of you contact the rest of the team, and everybody brings up their CUMULVS viewers to take a look at the new wing design. It is unanimous, everyone agrees that it is time to build a physical prototype of the wing. Success!

This type of CUMULVS functionality was recently put to the test as part of the High Performance Computing Challenge competition of the Supercomputing '96 conference held in November 1996 in Pittsburgh. CUMULVS won a Silver Medal for Innovation for its contribution to high performance scientific computing. The ORNL team, consisting of James Kohl, Philip Papadopoulos, Dave Semeraro and Al Geist, demonstrated a collaborative visualization of three-dimensional air flow over an aircraft wing (much like the "hypothetical" scenario described above). Figure 1 illustrates a CUMULVS view of the running CFD air flow simulation over a wing. Figure 2 shows a view of the residuals field that helped Dave Semeraro debug the simulation.

Figure 1: CUMULVS view for a Computational Fluid Dynamics (CFD) simulation of the air flow over a jet airplane wing. This demonstration won ORNL a Silver Medal for Innovation at Supercomputing '96.

Figure 2: CUMULVS view of the residuals field for the CFD simulation in Figure 1. The spike in the corner reveals an omitted computation cell in the parallel decomposition.