PSTSWM Platform Comparision I

Performance Studies using

PSTSWM


Parallel System Comparison I

Fairness is a difficult issue in comparing computer systems, especially parallel systems. If a given production code needs to be run without change, running it as is on different platforms and comparing the results is a fair measure of how well each of these systems will run the particular code. However, this is unlikely to be a fair measure of the systems in any other context. A somewhat better approach to use when the evaluation needs to be applicable to more than a single code or fixed workload is to run a series of tests, including low level benchmarks that examine the performance in individual subsystems, kernel codes reflecting common and/or important tasks, and a selection of representative compact or full application codes. From this data, hopefully, a subset can be drawn that will reflect any given workload.

Even this tiered approach to benchmarking does not address the question of tuning the individual tests for each given platform. Benchmark suites that do consider this typically have "run" rules that define 3 levels of tuning: no tuning, moderate tuning, and anything goes (as long as the same problem is solved).

PSTSWM is used for benchmarking in two different settings. First, PSTSWM is a "compact application" in the Parallel Kernels and Benchmarks Suite (ParkBench). Second, it is part of the tuning and evaluation suite used for porting the message-passing parallel implementation of the NCAR Community Climate Model that uses a two-dimensional data decomposition (CCM/MP-2D). For CCM/MP-2D, there are a number of parallel algorithm and algorithm implementation options that can be used to tune the performance of the code on different machines. Changes outside of this set are unlikely to be implemented unless a given platform already shows promise (and is purchased by a user of CCM/MP-2D). The options supported by CCM/MP-2D are a subset of those supported by PSTSWM, and so PSTSWM is a useful tool for evaluating different platforms for use with CCM/MP-2D.

Subsets of the parallel algorithms supported in PSTSWM are "identical": changing from one to another in the subset changes only the communication cost, not the computational complexity or the load balance. In this study we compare the performance of PSTSWM across platforms for each subset. When timing on a given platform, the optimal performer within a subset for that platform is used, essentially allowing us to use "tuned" parallel algorithms for each platform, and to be somewhat fairer in the evaluation. However, these are still all message-passing algorithms, so platforms supporting a shared-memory programming paradigm will only be evaluated as a message-passing platform.

In the spectral transform method used in PSTSWM, fields are transformed at each timestep between the physical (longitude-latitude-vertical) and the Fourier (wavenumber-latitude-vertical) domains using Fourier transforms in the longitude direction, and between the Fourier and spectral (spectral coefficients - vertical) domains using a Legendre transform in the latitude direction. All parallel algorithms in PSTSWM are based on decomposing the different computational domains onto a logical two-dimensional grid of processors, PXx PY. In each domain two of the domain dimensions are decomposed across the processor grid, for example, assigning longitude-latitude patches of the physical domain to individual processors, but leaving one domain dimension undecomposed.

Two general types of parallel algorithms are used in PSTSWM: transpose and distributed. In a transpose algorithm, the decomposition is ``rotated'' before a transform begins, to ensure that all data needed to compute a particular transform is local to a single processor. In a distributed algorithm the original decomposition of the domain is retained, and communication is performed to allow the processors to cooperate in the calculation of a transform.

Three "identical" transpose algorithms are available, each of which is functionally equivalent to MPI_ALLTOALLV:

srtrans: sends P-1 messages using SENDRECV to transpose across P processors;
swtrans: sends P-1 messages using SWAP to transpose across P processors;
logtrans: sends O(log(P)) messages using SWAP to transpose across P processors.
srtrans, swtrans, and logtrans all use different orderings of interprocessor communication between processors, and logtrans sends more data than the other two. Each of these are options for both the parallel Fourier and parallel Legendre transform algorithms.

Three "identical" distributed Legendre transform algorithms are available, each of which is functionally equivalent to MPI_ALLREDUCE::

exchsum: an exchange-based algorithm for distributed vector sum, implemented using SWAP;
halfsum: a recursive halving-based algorithm for distributed vector sum, implemented using SWAP;
ringsum: a ring-based algorithm for distributed vector sum, implemented using SENDRECV.

There is one additional distributed Legendre transform algorithm:

ringpipe: a pipeline-based algorithm for distributed vector sum, implemented using SENDRECV.
ringpipe employs a different decomposition of the spectral domain than the other Legendre transform algorithms, resulting in a smaller parallel complexity for computation in the spectral domain. But the complexity of computation in the spectral domain is dominated by that in the Fourier and physical domains, and the performance differences between ringpipe and the other distributed Legendre transform algorithms are primarily due to differences in communication cost.

Finally, a distributed Fast Fourier transform is available:

dfft: sends O(log (P)) messages using SWAP to calculate Fourier transform distributed across P processors.
Note that dfft is not equivalent to the transpose-based Fourier transform algorithms due to the differing effect on the distribution of spectral coefficients and load balancing, and so is treated as a separate subset.

Each of these algorithms can be implemented using two general classes of message-passing protocols: unordered and ordered. Two different types of implementations are also supported: overlap, in which we attempt to hide latency and overlap computation and communication, and nonoverlap. See here for a more complete description of these options. In this study we use the optimal protocols and implementation options (identified empirically) for each parallel algorithm when identifying the "optimal" parallel algorithm from a given subset for a given platform.

To examine the performance difference between the different parallel platforms for a given aplgorithm subset, we run one or more of the following experiments.

A: We use one-dimensional parallelizations of the form 8x1 or 1x8, 16x1 or 1x16, and 32x1 or 1x32, where the first decomposition in each pair is for examining parallel Fourier transform algorithms, and the second is for examining parallel Lengendre transform algorithms. The problem sizes are based on T42L16 (64x128x16 physical domain grid) and T85L32 (128x256x32 physical domain grid) as they would appear using a two-dimensional parallelization of the form 8x8, 16x32, or 32x16. This is accomplished by modifying the problem size to achieve the desired granularity (problem size per processor). Experiment A allows us to examine the performance of individual parallel algorithms in isolation for problem granularities that are typical of what would be seen in production.

In actual practice, each parallel algorithm will run on subsets of processors in a two-dimensional virtual processor grid. Thus Experiment A does not take into account the performance effect of a realistic assignment of processes to processors when the virtual processor grid does not match the physical interconnection topology, nor the contention caused by multiple subsets running the same parallel algorithm. This experiment also overestimates the importance of the performance variation due to the choice of communication protocol. The measured runtimes will be smaller than those for the corresponding "real" two-dimensional runs since they do not include the communication cost for the parallel algorithm in the other direction.

B: We use two-dimensional parallelizations of the forms 8x8, 16x4 or 4x16, and 32x2 or 2x32. The "larger" dimension is used to examine the parallel Fourier or Legendre transform options, while a fixed parallel algorithm is used for the "smaller" dimension. The parallel Fourier transform routines are paired with ringsum using a nonoverlap implementation option. The parallel Legendre transforms are paired with either dfft using (1,6) or swtrans using a nonoverlap implementation. A row-major processor ordering is used, with processor row 1 assigned processors {0,...,PX-1}, processor row 2 assigned processors {PX,...,2*PX-1}, etc. As in experiment A, problem sizes are based on T42L16 and T85L32 as they would appear on a two dimensional grid of size 8x8, 16x32, or 32x16.

Experiment B allows us to examine one example of the contention for resources when running multiple copies of the parallel algorithm. It is not the same amount of contention as would be seen in a real run for the 16 and 32 processor cases, but it should provide us with some information. This approach also has serious problems in that the "other" parallel algorithm is not optimized for the given platform, and if it coincidentally runs better on one platform than another, this will bias the results. However, the choice of the fixed parallel algorithm minimizes this somewhat, as they do not attempt to exploit any performance-sensitive features of the platforms.

Note also that this experiment underestimates the performance variation due to the communication protocol in that the execution in a realistic run will be smaller. The default parallel algorithm for the other dimension uses a communication protocol that is generally one of the worst performers. It was chosen because it is the only option that is supported across all of the message passing libraries that PSTSWM supports.

C: We use the same two-dimensional decompositions, default parallel algorithms, and problem sizes as in experiment B, but now use a column-major processor ordering. That is, processor column 1 is assigned processors {0,...,PY-1}, processor column 2 is assigned processors {PY,...,2*PY-1}, etc. This generates a different process placement and potentially different contention.

Results are presented as scatterplots for each experiment and platform. (The datapoints for each platform are connected, to make it simpler to identify the platforms. However, there is no meaning to the value of the lines between the x-axis tick marks.) The x-axis is the (problem size,number of processors) pair, and the y-axis is the relative performance (time/min) for 12 timesteps of PSTSWM using the standard benchmark problem case (global steady state nonlinear zonal geostrophic flow). Here time is the minimum time seen on the given platform for the given parallel algorithm subset. Statistics summarizing the data are presented below each plot.

Two plots are generated for each parallel algorithm subset:

1) Each x-axis tick mark will have results for the optimal MPI implementation for each platform. Below the plot is a table indicating the parallel platform with the smallest timing results.
2) Same as (1), except that the optimum is taken over both MPI and any "native" message-passsing libraries for each platform.

Results are presented for the distributed Fourier transform algorithms, for the distributed Legendre transform algorithms, and for three different experiments using the transpose algorithms. The second transpose experiment uses a Px1 decomposition, representing the parallel Fourier transform. The other two use a 1xP decomposition, representing a parallel Legendre transform, but where the first assumes a distributed parallel Fourier transform while the other assumes a transpose parallel Fourier transform. The distributed Legendre transform results are also presented in two different ways, the first including ringpipe, and the second excluding it.

Note that timings are for the entire code, and that differences in the computational rate of the "serial" code may well dominate the measured performance differences. These results in no way indicate the efficiency of the parallel algorithms on the different machines.

RESULTS:

A:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
B:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
C:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II

PSTSWM Performance Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 09:59:44 EDT.
82916 accesses since 1/2/96.