|
|
A variety of parallel algorithms are supported in PSTSWM. Subsets of these are "identical": changing from one to another in the subset changes only the communication cost, not the computational complexity or the load balance. In this study we compare the performance of the different parallel algorithms within a subset for each platform and message-passing library, determining both which parallel algorithms are optimal (and when), and the sensitivity of performance to the choice of the parallel algorithm. We also compare the performance of the individual parallel algorithms as implemented using different message-passing libraries.
In the spectral transform method used in PSTSWM, fields are transformed at each timestep between the physical (longitude-latitude-vertical) and the Fourier (wavenumber-latitude-vertical) domains using Fourier transforms in the longitude direction, and between the Fourier and spectral (spectral coefficients - vertical) domains using a Legendre transform in the latitude direction. All parallel algorithms in PSTSWM are based on decomposing the different computational domains onto a logical two-dimensional grid of processors, PXx PY. In each domain two of the domain dimensions are decomposed across the processor grid, for example, assigning longitude-latitude patches of the physical domain to individual processors, but leaving one domain dimension undecomposed.
Two general types of parallel algorithms are used in PSTSWM: transpose and distributed. In a transpose algorithm, the decomposition is ``rotated'' before a transform begins, to ensure that all data needed to compute a particular transform is local to a single processor. In a distributed algorithm the original decomposition of the domain is retained, and communication is performed to allow the processors to cooperate in the calculation of a transform.
Three "identical" transpose algorithms are available, each of which is functionally equivalent to MPI_ALLTOALLV:
Three "identical" distributed Legendre transform algorithms are available, each of which is functionally equivalent to MPI_ALLREDUCE:
There is one additional distributed Legendre transform algorithm:
Finally, a distributed Fast Fourier transform is available:
Each of these algorithms can be implemented using two general classes of message-passing protocols: unordered and ordered. Two different types of implementations are also supported: overlap, in which we attempt to hide latency and overlap computation and communication, and nonoverlap. See here for a more complete description of these options. In this study we use the optimal protocols and implementation options (identified empirically) for each parallel algorithm when making comparisons.
To examine the performance variation between the different parallel algorithms, we run one or more of the following experiments.
In actual practice, each parallel algorithm will run on subsets of processors in a two-dimensional virtual processor grid. Thus Experiment A does not take into account the performance effect of a realistic assignment of processes to processors when the virtual processor grid does not match the physical interconnection topology, nor the contention caused by multiple subsets running the same parallel algorithm. This experiment also overestimates the importance of the performance variation due to the choice of communication protocol. The measured runtimes will be smaller than those for the corresponding "real" two-dimensional runs since they do not include the communication cost for the parallel algorithm in the other direction.
Experiment B allows us to examine one example of the contention for resources when running multiple copies of the parallel algorithm. It is not the same amount of contention as would be seen in a real run for the 16 and 32 processor cases, but it should provide us with some information. Note that this experiment underestimates the performance variation due to the communication protocol in that the execution in a realistic run will be smaller. The default parallel algorithm for the other dimension uses a communication protocol that is generally one of the worst performers. It was chosen because it is the only option that is supported across all of the message passing libraries that PSTSWM supports.
Results are presented as scatterplots for each parallel platform, experiment, and parallel algorithm subset. The x-axis is the (problem size,number of processors) pair, and the y-axis is the relative performance degradation ((time - min)/min) for 12 timesteps of PSTSWM using the standard benchmark problem case (global steady state nonlinear zonal geostrophic flow). Statistics summarizing the data are presented below each plot.
Up to three plots are generated for each parallel algorithm subset:
Results are presented for the distributed Fourier transform algorithms, for the distributed Legendre transform algorithms, and for three different experiments using the transpose algorithms. The second transpose experiment uses a Px1 decomposition, representing the parallel Fourier transform. The other two use a 1xP decomposition, representing a parallel Legendre transform, but where the first assumes a distributed parallel Fourier transform while the other assumes a transpose parallel Fourier transform. The distributed Legendre transform results are also presented in two different ways, the first including ringpipe, and the second excluding it.
Note that timings are for the entire code, and that percentage differences in communication costs will generally be much higher than the percentage differences in total runtime reported for comparable tests cases.
Patrick H. Worley / (
worleyph@ornl.gov)