CCM/MP-2D Performance Studies

Performance Studies using

CCM/MP-2D


CCM/MP-2D Performance Evaluations

Fairness is a difficult issue in comparing computer systems, especially parallel systems. If a given production code needs to be run without change, running it as is on different platforms and comparing the results is a fair measure of how well each of these systems will run the particular code. However, this is unlikely to be a fair measure of the systems in any other context.

CCM/MP-2D was designed to be tunable in certain aspects of its operation. A large number of parallel algorithm options are supported, as well as a choice of communication library. In the analysis below, we use the best performance seen on the given platforms using the available tuning options, but making no changes to the source code other than linking in math libraries. The optimal tuning parameters were identified in an extensive algorithm evaluation study.

One of the CCM/MP-2D compile-time options is to link in the MPICL instrumentation library. CCM/MP-2D has been hand instrumented using the MPICL traceevent command to capture timings for important phases in the computation. MPICL will automatically measure the time spent in MPI calls. At runtime, the user can disable instrumentation, collect summary data (profiling), or generate event traces. Event tracing is the most intrusive and generates very large trace files even for short runs. Profiling however has relatively low overhead and generates manageable output files.

In this study, we analyze the achieved performance of CCM/MP-2D for each platform. CCM/MP-2D was run with profiling enabled. Timings were also compared with uninstrumented experiments to verify that the instrumentation was not perturbing the performance significantly. The instrumentation data is used to estimate the time spent in five different categories:

Serial Work:
time spent in computation found in serial implementation
Comm:
time spent in interprocessor communication
Imbal:
time spent idle due to load imbalance
Copy:
time spent copying data in support of interprocessor communication
Dupl:
time spent in redundant spectral calculations
where the time is summed over all processes. Serial Work should be constant as a function of the number of processors. If it increases as the number of processors increases, then this indicates that the computational rate is decreasing. If Serial Work decreases, then the rate is increasing. We used the Serial Work estimate for the experiment using the smallest number of processors to define a baseline, and refer to the relative change from this baseline as the
Rate:
change in time due to a change in computation rate

For each platform and problem size, we present scaling results for each of these categories. First, we present runtime and execution rate plots (described below). Next, we present an estimate for the

Efficiency:
(Serial Work for the smallest number of processors)/((Parallel Runtime)*(Number of Processors)
expressed as a percentage. We also refer to (Parallel Runtime)*(Number of Processors) as the Parallel Work.

On some platforms we can not run the problem size on a single processor. In this case we use the runtime on the smallest number of processors for which we do have data and subtract out the measurable overhead to estimate the baseline Serial Work. Parallel Runtime is the length of the interval between which the first processor starts computing and the last processor finishes. We do not start timing until sometime after the first timestep, as the initialization phase and first timestep are not typical, and will not influence the performance of a (long) production run of the code.

For each of the parallel overhead categories, we graph

(Overhead Time)/(Parallel Work)
expressed as a percentage. For a 100% efficient parallel algorithm, Parallel Work would be equal to Serial Work, and thus any discrepancy represents the parallel overhead. Note that the sum of these overhead percentages is equal to the
Loss of Efficiency:
100% - Efficiency
Finally, we graph the cumulative overhead statistics:
  1. Comm
  2. Comm + Imbal
  3. Comm + Imbal + Copy
  4. Comm + Copy + Imbal + Dupl
  5. Comm + Copy + Imbal + Dupl + Rate
to make it easier to compare the contributions from the different categories.

The current results analyze the performance using the vendor-supplied MPI communication library. (While other libraries may provide higher performance, for example, SHMEM, modifications to CCM/MP-2D to allow use of these higher performance libraries has not been completed.) Note that these experiments do not include significant I/O. While I/O is an important part of the production use of the code, I/O is strongly dependent on the type of experiment being run, and would have made it more difficult to compare the parallel systems. Thus, these experiments represent the maximum performance that can be achieved, and a production run may experience lower performance.

Results are described for two problem sizes, T42L18 and T170L18. For T42L18, a 10 day simulation was timed, while for T170L18, a 2 day simulation was timed. A barrier was executed following the initialization (and possibly the first few timesteps), at which point timing was started. The same start time was used for processor counts for a given problem size and platform, but not necessarily between problem sizes and platforms. Given the length of the simulations, these variations will not affect any cross-platform comparisons. However, the focus of these experiments is on understanding where and why performance is lost on a given platform. Interplatform comparisons are described here.

Note that the parallel algorithm almost always varies as the number of processors changes. The scaling of the overhead corresponds to the minimum runtimes for this application, not for any given parallel algorithm. While this is somewhat different from the typical analysis, it is what is interesting when evaluating a platforms ability to run CCM/MP-2D.

For the runtime results, we present the runtime per model day as a function of the number of processors. The data was massaged to identify maximum time spent in a single day of simulation (time of last processor exiting simulation day - time of first processor entering simulation day) and minimum time spent in a single day of simulation (time of first processor exiting simulation day - time of last processor entering simulation day) for all simulation days after the first. The maximum is what is shown in the results, however the minimum differed from the maximum by less than one percent in these results.

For the execution rate results, we plot MFlops/second/processor as a function of the number of processors. In a perfectly scalable code, this second plot would be flat. Unlike traditional parallel speed-up or efficiency metrics, however, the actual performance is also visible in this plot, allowing the different platforms to be compared in a meaningful way, as here. The MFlop counts used for the rate plot are described here.

RESULTS:

Compaq AlphaServerSC-500
T42L18
T170L18
Compaq AlphaServerSC-667
T42L18
T170L18
Cray Research T3E-900
T42L18
T170L18
IBM SP3-200 (Winterhawk I)
T42L18
T170L18
IBM SP3-375 (Winterhawk II)
T42L18
T170L18
SGI Origin2000-250
T42L18
T170L18

CCM/MP-2D Performance Studies Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 09:58:46 EDT.
5272 accesses since 1/2/96.