CCM/MP-2D Performance Studies

Performance Studies using

CCM/MP-2D


Parallel Systems Comparisons

Fairness is a difficult issue in comparing computer systems, especially parallel systems. If a given production code needs to be run without change, running it as is on different platforms and comparing the results is a fair measure of how well each of these systems will run the particular code. However, this is unlikely to be a fair measure of the systems in any other context.

CCM/MP-2D was designed to be tunable in certain aspects of its operation. A large number of parallel algorithm options are supported, as well as a choice of communication library. In the results below, we describe the best performance seen on the given platforms using the available tuning options, but making no changes to the source code other than linking in math libraries. The optimal tuning parameters were identified in an extensive algorithm evaluation study.

The current results measure performance using the vendor-supplied MPI communication library. (While other libraries may provide higher performance, for example, SHMEM, modifications to CCM/MP-2D to allow use of these higher performance libraries have not been completed.) Note that these experiments do not include significant I/O. While I/O is an important part of the production use of the code, I/O is strongly dependent on the type of experiment being run, and would have made it more difficult to compare the parallel systems. Thus, these experiments represent the maximum performance that can be achieved, and a production run may experience lower performance.

Results are described for two problem sizes, T42L18 and T170L18. For T42L18, a 10 day simulation was timed, while for T170L18, a 2 day simulation was timed. The data was massaged to identify maximum time spent in a single day of simulation (time of last processor exiting simulation day - time of first processor entering simulation day) and minimum time spent in a single day of simulation (time of first processor exiting simulation day - time of last processor entering simulation day) for all simulation days after the first. The maximum is what is shown in the results below, however the minimum differed from the maximum by less than one percent in these results.

The results are described in two forms. First is a traditional plot of runtime versus the number of processors. Second is a plot of MFlops/second/processor as a function of the number of processors. In a perfectly scalable code, this second plot would be flat. Unlike traditional parallel speed-up or efficiency metrics, the actual performance is also visible in this plot, allowing the different platforms to be compared in a meaningful way. The MFlop counts used for the second plot are described here. These counts are somewhat unreasonable in that floating point operations such as sqrt are counted as a single operation, leading to unfairly low computational rates. However, these rates are still fair within the context of this evaluation.

RESULTS:

Systems

Compaq AlphaServer SC at Oak Ridge National Laboratory (since upgraded):

Compaq AlphaServer SC at Oak Ridge National Laboratory:

SGI Origin2000 at Los Alamos National Laboratory:

IBM SP (200MHz) at Oak Ridge National Laboratory and at the National Energy Research Scientific Computing Center:

IBM SP (375MHz) at Oak Ridge National Laboratory:

SGI/Cray Research T3E-900 at National Energy Research Scientific Computing Center:

T42L18

T170L18

CCM/MP-2D Performance Studies Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 09:58:46 EDT.
5849 accesses since 1/2/96.