|
|
In these studies we examine two classes of parallel algorithms, looking for optimal algorithms within each class, and comparing performance between the two classes. The classes were chosen for relevance to the parallel algorithms supported in CCM/MP-2D. They also represent extremes in the algorithm space, as will be described in more detail below, and have been useful studying the performance sensitivities of message passing on different parallel systems.
In the spectral transform method used in PSTSWM, fields are transformed at each timestep between the physical (longitude-latitude-vertical) and the Fourier (wavenumber-latitude-vertical) domains using Fourier transforms in the longitude direction, and between the Fourier and spectral (spectral coefficients - vertical) domains using a Legendre transform in the latitude direction. All parallel algorithms in PSTSWM are based on decomposing the different computational domains onto a logical two-dimensional grid of processors, PXx PY. In each domain two of the domain dimensions are decomposed across the processor grid, for example, assigning longitude-latitude patches of the physical domain to individual processors, but leaving one domain dimension undecomposed.
Two general types of parallel algorithms are used in PSTSWM: transpose and distributed. In a transpose algorithm, the decomposition is ``rotated'' before a transform begins, to ensure that all data needed to compute a particular transform is local to a single processor. In a distributed algorithm the original decomposition of the domain is retained, and communication is performed to allow the processors to cooperate in the calculation of a transform.
Three "identical" transpose algorithms are available, each of which is functionally equivalent to MPI_ALLTOALLV:
Three "identical" distributed Legendre transform algorithms are available, each of which is functionally equivalent to MPI_ALLREDUCE:
There is one additional distributed Legendre transform algorithm:
Finally, a distributed Fast Fourier transform is available:
Each of these algorithms can be implemented using two general classes of message-passing protocols: unordered and ordered. Two different types of implementations are also supported: overlap, in which we attempt to hide latency and overlap computation and communication, and nonoverlap. See here for a more complete description of these options.
A parallel algorithm for PSTSWM consists of pairing a parallel algorithm for the Fourier transform with a parallel algorithm for the Legendre transform. There are five classes of parallel algorithms supported in PSTSWM:
Algorithm class 5 uses a transpose FFT that decomposes over both vertical level and number of fields, followed by another transpose that undecomposes the vertical and field "dimensions" and decomposes instead over the wavenumber dimension. This requires somewhat more data movement than the other transpose algorithms, but it has the minimum load imbalance and does not require the extra tranposes in spectral space. It is also the only transpose algorithm currently supported in CCM/MP-2D.
In our original studies, we examined and compared all members of algorithm classes 1-4 for all aspect ratios and many different problems sizes. This was a very time-consuming exercise, and we have since narrowed our focus to experiments that address specific performance questions. More recently we have concentrated on examining two particular algorithm subclasses:
In the following studies, the results from the one-dimensional tuning studies are used to first identify a set of good parallel algorithms and algorithm implementations.
Patrick H. Worley / (
worleyph@ornl.gov)