|
|
Fairness is a difficult issue in comparing computer systems, especially parallel systems. If a given production code needs to be run without change, running it as is on different platforms and comparing the results is a fair measure of how well each of these systems will run the particular code. However, this is unlikely to be a fair measure of the systems in any other context.
CCM/MP-2D was designed to be tunable in certain aspects of its operation. A large number of parallel algorithm options are supported, as well as a choice of communication library. In the analysis below, we use the best performance seen on the given platforms using the available tuning options, but making no changes to the source code other than linking in math libraries. The optimal tuning parameters were identified in an extensive algorithm evaluation study.
One of the CCM/MP-2D compile-time options is to link in the MPICL instrumentation library. CCM/MP-2D has been hand instrumented using the MPICL traceevent command to capture timings for important phases in the computation. MPICL will automatically measure the time spent in MPI calls. At runtime, the user can disable instrumentation, collect summary data (profiling), or generate event traces. Event tracing is the most intrusive and generates very large trace files even for short runs. Profiling however has relatively low overhead and generates manageable output files.
In this study, we analyze the achieved performance of CCM/MP-2D for each platform. CCM/MP-2D was run with profiling enabled. Timings were also compared with uninstrumented experiments to verify that the instrumentation was not perturbing the performance significantly. The instrumentation data is used to estimate the time spent in five different categories:
For each platform and problem size, we present scaling results for each of these categories. First, we present runtime and execution rate plots (described below). Next, we present an estimate for the
On some platforms we can not run the problem size on a single processor. In this case we use the runtime on the smallest number of processors for which we do have data and subtract out the measurable overhead to estimate the baseline Serial Work. Parallel Runtime is the length of the interval between which the first processor starts computing and the last processor finishes. We do not start timing until sometime after the first timestep, as the initialization phase and first timestep are not typical, and will not influence the performance of a (long) production run of the code.
For each of the parallel overhead categories, we graph
The current results analyze the performance using the vendor-supplied MPI communication library. (While other libraries may provide higher performance, for example, SHMEM, modifications to CCM/MP-2D to allow use of these higher performance libraries has not been completed.) Note that these experiments do not include significant I/O. While I/O is an important part of the production use of the code, I/O is strongly dependent on the type of experiment being run, and would have made it more difficult to compare the parallel systems. Thus, these experiments represent the maximum performance that can be achieved, and a production run may experience lower performance.
Results are described for two problem sizes, T42L18 and T170L18. For T42L18, a 10 day simulation was timed, while for T170L18, a 2 day simulation was timed. A barrier was executed following the initialization (and possibly the first few timesteps), at which point timing was started. The same start time was used for processor counts for a given problem size and platform, but not necessarily between problem sizes and platforms. Given the length of the simulations, these variations will not affect any cross-platform comparisons. However, the focus of these experiments is on understanding where and why performance is lost on a given platform. Interplatform comparisons are described here.
Note that the parallel algorithm almost always varies as the number of processors changes. The scaling of the overhead corresponds to the minimum runtimes for this application, not for any given parallel algorithm. While this is somewhat different from the typical analysis, it is what is interesting when evaluating a platforms ability to run CCM/MP-2D.
For the runtime results, we present the runtime per model day as a function of the number of processors. The data was massaged to identify maximum time spent in a single day of simulation (time of last processor exiting simulation day - time of first processor entering simulation day) and minimum time spent in a single day of simulation (time of first processor exiting simulation day - time of last processor entering simulation day) for all simulation days after the first. The maximum is what is shown in the results, however the minimum differed from the maximum by less than one percent in these results.
For the execution rate results, we plot MFlops/second/processor as a function of the number of processors. In a perfectly scalable code, this second plot would be flat. Unlike traditional parallel speed-up or efficiency metrics, however, the actual performance is also visible in this plot, allowing the different platforms to be compared in a meaningful way, as here. The MFlop counts used for the rate plot are described here.
CCM/MP-2D Performance Studies Page
Patrick H. Worley / (
worleyph@ornl.gov)