PSTSWM Algorithm Comparision II

Performance Studies using

PSTSWM


Parallel Algorithm Comparison II

UNDER CONTRUCTION - COME BACK LATER

A very large number of parallel algorithms is supported in PSTSWM, and it is time consuming to examine all possibilities in order to determine the optimal algorithms. While we have done this in the past, we currently use PSTSWM for more limited studies. We save the more exhaustive studies for the global atmospheric circulation model CCM/MP-2D.

In these studies we examine two classes of parallel algorithms, looking for optimal algorithms within each class, and comparing performance between the two classes. The classes were chosen for relevance to the parallel algorithms supported in CCM/MP-2D. They also represent extremes in the algorithm space, as will be described in more detail below, and have been useful studying the performance sensitivities of message passing on different parallel systems.

In the spectral transform method used in PSTSWM, fields are transformed at each timestep between the physical (longitude-latitude-vertical) and the Fourier (wavenumber-latitude-vertical) domains using Fourier transforms in the longitude direction, and between the Fourier and spectral (spectral coefficients - vertical) domains using a Legendre transform in the latitude direction. All parallel algorithms in PSTSWM are based on decomposing the different computational domains onto a logical two-dimensional grid of processors, PXx PY. In each domain two of the domain dimensions are decomposed across the processor grid, for example, assigning longitude-latitude patches of the physical domain to individual processors, but leaving one domain dimension undecomposed.

Two general types of parallel algorithms are used in PSTSWM: transpose and distributed. In a transpose algorithm, the decomposition is ``rotated'' before a transform begins, to ensure that all data needed to compute a particular transform is local to a single processor. In a distributed algorithm the original decomposition of the domain is retained, and communication is performed to allow the processors to cooperate in the calculation of a transform.

Three "identical" transpose algorithms are available, each of which is functionally equivalent to MPI_ALLTOALLV:

srtrans: sends P-1 messages using SENDRECV to transpose across P processors;
swtrans: sends P-1 messages using SWAP to transpose across P processors;
logtrans: sends O(log(P)) messages using SWAP to transpose across P processors.
srtrans, swtrans, and logtrans all use different orderings of interprocessor communication between processors, and logtrans sends more data than the other two. Each of these are options for both the parallel Fourier and parallel Legendre transform algorithms.

Three "identical" distributed Legendre transform algorithms are available, each of which is functionally equivalent to MPI_ALLREDUCE:

exchsum: an exchange-based algorithm for distributed vector sum, implemented using SWAP;
halfsum: a recursive halving-based algorithm for distributed vector sum, implemented using SWAP;
ringsum: a ring-based algorithm for distributed vector sum, implemented using SENDRECV.

There is one additional distributed Legendre transform algorithm:

ringpipe: a pipeline-based algorithm for distributed vector sum, implemented using SENDRECV.
ringpipe employs a different decomposition of the spectral domain than the other Legendre transform algorithms, resulting in a smaller parallel complexity for computation in the spectral domain. But the complexity of computation in the spectral domain is dominated by that in the Fourier and physical domains, and the performance differences between ringpipe and the other distributed Legendre transform algorithms are primarily due to differences in communication cost.

Finally, a distributed Fast Fourier transform is available:

dfft: sends O(log (P)) messages using SWAP to calculate Fourier transform distributed across P processors.

Each of these algorithms can be implemented using two general classes of message-passing protocols: unordered and ordered. Two different types of implementations are also supported: overlap, in which we attempt to hide latency and overlap computation and communication, and nonoverlap. See here for a more complete description of these options.

A parallel algorithm for PSTSWM consists of pairing a parallel algorithm for the Fourier transform with a parallel algorithm for the Legendre transform. There are five classes of parallel algorithms supported in PSTSWM:

  1. distributed FFT / distributed LT
  2. distributed FFT / transpose LT
  3. transpose FFT / transpose LT
  4. transpose FFT / distributed LT
  5. double transpose FFT / distributed LT
Classes 2-4 decompose the vertical domain in spectral space. In the three dimensional models that PSTSWM was designed to emulate, the computation in spectral space involves a linear system solution over the vertical dimension for some of the fields. To model this, PSTSWM requires that the vertical dimension not be decomposed in spectral space for these fields, and an additional pair of transposes is added to these parallel algorithms to effect this.

Algorithm class 5 uses a transpose FFT that decomposes over both vertical level and number of fields, followed by another transpose that undecomposes the vertical and field "dimensions" and decomposes instead over the wavenumber dimension. This requires somewhat more data movement than the other transpose algorithms, but it has the minimum load imbalance and does not require the extra tranposes in spectral space. It is also the only transpose algorithm currently supported in CCM/MP-2D.

In our original studies, we examined and compared all members of algorithm classes 1-4 for all aspect ratios and many different problems sizes. This was a very time-consuming exercise, and we have since narrowed our focus to experiments that address specific performance questions. More recently we have concentrated on examining two particular algorithm subclasses:

on square or nearly-square domain decompositions. These two subclasses represent algorithmic extremes, with the first having the maximum opportunity for overlapping communication and computation, while the interprocessor communication for the second can be implemented using only collective communication routines. The earlier exhaustive studies indicated that squarish domains were typically near optimal, so this restriction is not of great consequence. Note, however, that square domains are not typically optimal for CCM/MP-2D, due to the effects of the other parallel algorithms found in the full model.

In the following studies, the results from the one-dimensional tuning studies are used to first identify a set of good parallel algorithms and algorithm implementations.

PSTSWM Performance Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:36:15 EDT.
5210 accesses since 1/2/96.