PSTSWM Algorithm Comparision I

Performance Studies using

PSTSWM


Parallel Algorithm Comparison I

A variety of parallel algorithms are supported in PSTSWM. Subsets of these are "identical": changing from one to another in the subset changes only the communication cost, not the computational complexity or the load balance. In this study we compare the performance of the different parallel algorithms within a subset for each platform and message-passing library, determining both which parallel algorithms are optimal (and when), and the sensitivity of performance to the choice of the parallel algorithm. We also compare the performance of the individual parallel algorithms as implemented using different message-passing libraries.

In the spectral transform method used in PSTSWM, fields are transformed at each timestep between the physical (longitude-latitude-vertical) and the Fourier (wavenumber-latitude-vertical) domains using Fourier transforms in the longitude direction, and between the Fourier and spectral (spectral coefficients - vertical) domains using a Legendre transform in the latitude direction. All parallel algorithms in PSTSWM are based on decomposing the different computational domains onto a logical two-dimensional grid of processors, PXx PY. In each domain two of the domain dimensions are decomposed across the processor grid, for example, assigning longitude-latitude patches of the physical domain to individual processors, but leaving one domain dimension undecomposed.

Two general types of parallel algorithms are used in PSTSWM: transpose and distributed. In a transpose algorithm, the decomposition is ``rotated'' before a transform begins, to ensure that all data needed to compute a particular transform is local to a single processor. In a distributed algorithm the original decomposition of the domain is retained, and communication is performed to allow the processors to cooperate in the calculation of a transform.

Three "identical" transpose algorithms are available, each of which is functionally equivalent to MPI_ALLTOALLV:

srtrans: sends P-1 messages using SENDRECV to transpose across P processors;
swtrans: sends P-1 messages using SWAP to transpose across P processors;
logtrans: sends O(log(P)) messages using SWAP to transpose across P processors.
srtrans, swtrans, and logtrans all use different orderings of interprocessor communication between processors, and logtrans sends more data than the other two. Each of these are options for both the parallel Fourier and parallel Legendre transform algorithms.

Three "identical" distributed Legendre transform algorithms are available, each of which is functionally equivalent to MPI_ALLREDUCE:

exchsum: an exchange-based algorithm for distributed vector sum, implemented using SWAP;
halfsum: a recursive halving-based algorithm for distributed vector sum, implemented using SWAP;
ringsum: a ring-based algorithm for distributed vector sum, implemented using SENDRECV.

There is one additional distributed Legendre transform algorithm:

ringpipe: a pipeline-based algorithm for distributed vector sum, implemented using SENDRECV.
ringpipe employs a different decomposition of the spectral domain than the other Legendre transform algorithms, resulting in a smaller parallel complexity for computation in the spectral domain. But the complexity of computation in the spectral domain is dominated by that in the Fourier and physical domains, and the performance differences between ringpipe and the other distributed Legendre transform algorithms are primarily due to differences in communication cost.

Finally, a distributed Fast Fourier transform is available:

dfft: sends O(log (P)) messages using SWAP to calculate Fourier transform distributed across P processors.
dfft can not be compared to the transpose-based Fourier transform algorithms using the experiments described below, but comparisons will be made between the performance using different communication libraries. Comparisons between distributed and transpose algorithms can be found here.

Each of these algorithms can be implemented using two general classes of message-passing protocols: unordered and ordered. Two different types of implementations are also supported: overlap, in which we attempt to hide latency and overlap computation and communication, and nonoverlap. See here for a more complete description of these options. In this study we use the optimal protocols and implementation options (identified empirically) for each parallel algorithm when making comparisons.

To examine the performance variation between the different parallel algorithms, we run one or more of the following experiments.

A: We use one-dimensional parallelizations of the form 8x1 or 1x8, 16x1 or 1x16, and 32x1 or 1x32, where the first decomposition in each pair is for examining parallel Fourier transform algorithms, and the second is for examining parallel Lengendre transform algorithms. The problem sizes are based on T42L16 (64x128x16 physical domain grid) and T85L32 (128x256x32 physical domain grid) as they would appear using a two-dimensional parallelization of the form 8x8, 16x32, or 32x16. This is accomplished by modifying the problem size to achieve the desired granularity (problem size per processor). Experiment A allows us to examine the performance of individual parallel algorithms in isolation for problem granularities that are typical of what would be seen in production.

In actual practice, each parallel algorithm will run on subsets of processors in a two-dimensional virtual processor grid. Thus Experiment A does not take into account the performance effect of a realistic assignment of processes to processors when the virtual processor grid does not match the physical interconnection topology, nor the contention caused by multiple subsets running the same parallel algorithm. This experiment also overestimates the importance of the performance variation due to the choice of communication protocol. The measured runtimes will be smaller than those for the corresponding "real" two-dimensional runs since they do not include the communication cost for the parallel algorithm in the other direction.

B: We use two-dimensional parallelizations of the forms 8x8, 16x4 or 4x16, and 32x2 or 2x32. The "larger" dimension is used to examine the parallel Fourier or Legendre transform options, while a fixed parallel algorithm is used for the "smaller" dimension. The parallel Fourier transform routines are paired with ringsum using a nonoverlap implementation option. The parallel Legendre transforms are paired with either dfft using (1,6) or swtrans using a nonoverlap implementation. A row-major processor ordering is used, with processor row 1 assigned processors {0,...,PX-1}, processor row 2 assigned processors {PX,...,2*PX-1}, etc. As in experiment A, problem sizes are based on T42L16 and T85L32 as they would appear on a two dimensional grid of size 8x8, 16x32, or 32x16.

Experiment B allows us to examine one example of the contention for resources when running multiple copies of the parallel algorithm. It is not the same amount of contention as would be seen in a real run for the 16 and 32 processor cases, but it should provide us with some information. Note that this experiment underestimates the performance variation due to the communication protocol in that the execution in a realistic run will be smaller. The default parallel algorithm for the other dimension uses a communication protocol that is generally one of the worst performers. It was chosen because it is the only option that is supported across all of the message passing libraries that PSTSWM supports.

C: We use the same two-dimensional decompositions, default parallel algorithms, and problem sizes as in experiment B, but now use a column-major processor ordering. That is, processor column 1 is assigned processors {0,...,PY-1}, processor column 2 is assigned processors {PY,...,2*PY-1}, etc. This generates a different process placement and potentially different contention.

Results are presented as scatterplots for each parallel platform, experiment, and parallel algorithm subset. The x-axis is the (problem size,number of processors) pair, and the y-axis is the relative performance degradation ((time - min)/min) for 12 timesteps of PSTSWM using the standard benchmark problem case (global steady state nonlinear zonal geostrophic flow). Statistics summarizing the data are presented below each plot.

Up to three plots are generated for each parallel algorithm subset:

1) Each x-axis tick mark will have results for the optimal MPI implementation of each parallel algorithm, as well as a generic, nonoverlap MPI implementation of one of the parallel algorithms. For the transpose subsets, we use srtrans. For the distributed Legendre transpose, we use ringsum. Below the plot is a table indicating the best parallel algorithm and the performance degradation from using
2) Same as (1), except using a different message-passsing library for the platform.
3) Each x-axis tick mark will have results for the optimal MPI parallel algorithm, the optimal non-MPI parallel algorithm, and the MPI collective communication routine implementation. The table below the plot indicates the best parallel algorithm and the performance degradation from using the optimal MPI parallel algorithm (instead of the true optimal) and the performance degradation from using the MPI collective communication routine.

Results are presented for the distributed Fourier transform algorithms, for the distributed Legendre transform algorithms, and for three different experiments using the transpose algorithms. The second transpose experiment uses a Px1 decomposition, representing the parallel Fourier transform. The other two use a 1xP decomposition, representing a parallel Legendre transform, but where the first assumes a distributed parallel Fourier transform while the other assumes a transpose parallel Fourier transform. The distributed Legendre transform results are also presented in two different ways, the first including ringpipe, and the second excluding it.

Note that timings are for the entire code, and that percentage differences in communication costs will generally be much higher than the percentage differences in total runtime reported for comparable tests cases.

RESULTS:

Compaq AlphaServer SC
A:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
Cray Research T3D
A:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
Cray Research T3E-900
A:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
B:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
C:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
IBM SP2-66
A1:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
A2:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
IBM SP3-200 (Winterhawk I)
(without shared memory implementation for intranode communication)
A1:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
A2:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
A3:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
B1:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
B2:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
C:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
IBM SP3-200 (Winterhawk I)
(with shared memory implementation for intranode communication)
A:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
IBM SP3-375 (Winterhawk II)
A:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
B:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
C:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
Intel Paragon
OSF/MPI and OSF/NX
A1:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
OSF/NX and SUNMOS
A2:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
SGI Origin2000-195
A:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
B:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
C:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
SGI Origin2000-250
A:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
B:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II
C:     Distributed Fourier Transform Distributed Legendre Tranform I
Distributed Legendre Transform II Transpose Fourier Transform
Transpose Legendre Transform I Transpose Legendre Transform II

PSTSWM Performance Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 09:58:30 EDT.
5275 accesses since 1/2/96.