PSTSWM Communication Characterization

Performance Studies using

PSTSWM


Message-Passing Protocol Sensitivity

Communication overhead for parallel runs of PSTSWM on a given platform is a function of the parallel algorithm, the parallel algorithm implementation, and the problem granularity (problem size and number of processors). One aspect of the parallel implementation that we have found to be important on some platforms is the message-passing protocol. Some indication of the variation that can occur from the choice of message-passing protcol can be seen from the basic point-to-point communication performance results described here. In these studies, we examine the impact of the choice of protocol in more detail, looking at the effect on the performance of specific parallel algorithm options in PSTSWM.

In the spectral transform method used in PSTSWM, fields are transformed at each timestep between the physical (longitude-latitude-vertical) and the Fourier (wavenumber-latitude-vertical) domains using Fourier transforms in the longitude direction, and between the Fourier and spectral (spectral coefficients - vertical) domains using a Legendre transform in the latitude direction. All parallel algorithms in PSTSWM are based on decomposing the different computational domains onto a logical two-dimensional grid of processors, PXx PY. In each domain two of the domain dimensions are decomposed across the processor grid, for example, assigning longitude-latitude patches of the physical domain to individual processors, but leaving one domain dimension undecomposed.

Two general types of parallel algorithms are used in PSTSWM: transpose and distributed. In a transpose algorithm, the decomposition is ``rotated'' before a transform begins, to ensure that all data needed to compute a particular transform is local to a single processor. In a distributed algorithm the original decomposition of the domain is retained, and communication is performed to allow the processors to cooperate in the calculation of a transform.

Three transpose algorithms are examined, each of which is functionally equivalent to MPI_ALLTOALLV:

srtrans: sends P-1 messages using SENDRECV to transpose across P processors;
swtrans: sends P-1 messages using SWAP to transpose across P processors;
logtrans: sends O(log(P)) messages using SWAP to transpose across P processors.
srtrans, swtrans, and logtrans all use different orderings of interprocessor communication between processors, and logtrans sends more data than the other two. Each of these are options for both the parallel Fourier and parallel Legendre transform algorithms.

Four distributed Legendre transform algorithms are examined, the first three of which are functionally equivalent to MPI_ALLREDUCE:

exchsum: an exchange-based algorithm for distributed vector sum, implemented using SWAP;
halfsum: a recursive halving-based algorithm for distributed vector sum, implemented using SWAP;
ringsum: a ring-based algorithm for distributed vector sum, implemented using SENDRECV;
ringpipe: a pipeline-based algorithm for distributed vector sum, implemented using SENDRECV.
Finally, one distributed Fast Fourier transform is examined:
dfft: sends O(log (P)) messages using SWAP to calculate Fourier transform distributed across P processors.

Each of these algorithms can be implemented using two general classes of protocols: unordered and ordered. While not all protocols are available for all message-passing transport layers, they are drawn from those described below. Examples are given for the SWAP command using MPI commands. Similar definitions hold for the SENDRECV command. Note that the examples have been simplified (to save room) and do not accurately represent the MPI implementations. In particular, handshaking messages required for correct use of the ready send command have been omitted.

Unordered

 

Ordered

(0,0): simple
Processor 1 and 2
MPI_BSEND
MPI_RECV
(1,0): simple
Processor 1 Processor 2
MPI_SEND MPI_RECV
MPI_RECV MPI_SEND
(0,1): nonblocking send
Processor 1 and 2
MPI_ISEND
MPI_RECV
(1,1): nonblocking send
Processor 1 Processor 2
MPI_ISEND MPI_RECV
MPI_RECV MPI_SEND
(0,2): nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SEND
(1,2): nonblocking receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_SEND MPI_SEND
(0,3): nonblocking send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISEND
(1,3): nonblocking send and receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_ISEND MPI_SEND
(0,4): ready send
Processor 1 and 2
MPI_IRECV
MPI_RSEND
(1,4): ready send
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_RSEND MPI_RSEND
(0,5): nonblocking ready send
Processor 1 and 2
MPI_IRECV
MPI_IRSEND
(1,5): nonblocking ready send
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_IRSEND MPI_RSEND
(0,6): native sendrecv
Processor 1 and 2
MPI_SENDRECV
 
 
 
(1,6): synchronous
Processor 1 Processor 2
- MPI_RECV
MPI_SEND -
- MPI_SEND
MPI_RECV -
Two different types of implementations are also supported. The first uses the basic SWAP and SENDRECV command to exchange the data. The second reorders the elements of the SWAP or SENDRECV protocol to prepost the nonblocking receives and delay waiting for the nonblocking commands to complete, in an attempt to overlap communication with computation and hide communication latency. These algorithms and protocols are described in more detail in

available from here.

To examine the performance issues in these different implementation options, we run one or more of the following experiments.

A: We use one-dimensional parallelizations of the form 8x1 or 1x8, 16x1 or 1x16, and 32x1 or 1x32, where the first decomposition in each pair is for examining parallel Fourier transform algorithms, and the second is for examining parallel Lengendre transform algorithms. The problem sizes are based on T42L16 (64x128x16 physical domain grid) and T85L32 (128x256x32 physical domain grid) as they would appear using a two-dimensional parallelization of the form 8x8, 16x32, or 32x16. This is accomplished by modifying the problem size to achieve the desired granularity (problem size per processor). Experiment A allows us to examine the performance of individual parallel algorithms in isolation for problem granularities that are typical of what would be seen in production.

In actual practice, each parallel algorithm will run on subsets of processors in a two-dimensional virtual processor grid. Thus Experiment A does not take into account the performance effect of a realistic assignment of processes to processors when the virtual processor grid does not match the physical interconnection topology, nor the contention caused by multiple subsets running the same parallel algorithm. This experiment also overestimates the importance of the performance variation due to the choice of communication protocol. The measured runtimes will be smaller than those for the corresponding "real" two-dimensional runs since they do not include the communication cost for the parallel algorithm in the other direction.

B: We use two-dimensional parallelizations of the forms 8x8, 16x4 or 4x16, and 32x2 or 2x32. The "larger" dimension is used to examine the parallel Fourier or Legendre transform options, while a fixed parallel algorithm is used for the "smaller" dimension. The parallel Fourier transform routines are paired with ringsum using the (1,6) communication protocol. The parallel Legendre transforms are paired with either dfft using (1,6) or swtrans using (1,6). A row-major processor ordering is used, with processor row 1 assigned processors {0,...,PX-1}, processor row 2 assigned processors {PX,...,2*PX-1}, etc. As in experiment A, problem sizes are based on T42L16 and T85L32 as they would appear on a two dimensional grid of size 8x8, 16x32, or 32x16.

Experiment B allows us to examine one example of the contention for resources when running multiple copies of the parallel algorithm. It is not the same amount of contention as would be seen in a real run for the 16 and 32 processor cases, but it should provide us with some information. Note that this experiment underestimates the performance variation due to the communication protocol for the 8x8 case in that the execution in a realistic run will be smaller. The default parallel algorithm for the other dimension uses a communication protocol that is generally one of the worst performers. It was chosen because it is the only option that is supported across all of the message passing libraries that PSTSWM supports.

C: We use the same two-dimensional decompositions, default parallel algorithms, and problem sizes as in experiment B, but now use a column-major processor ordering. That is, processor column 1 is assigned processors {0,...,PY-1}, processor column 2 is assigned processors {PY,...,2*PY-1}, etc. This generates a different process placement and potentially different contention.

Results are presented as scatterplots for each parallel platform, set of experiments, parallel algorithm, problem size, and number of processors. The x-axis is the protocol option (0-6) and the y-axis is the execution time for 12 timesteps of PSTSWM using the standard benchmark problem case (global steady state nonlinear zonal geostrophic flow). Each x-axis tick mark will have results for unordered, unordered/overlap; ordered, and ordered/overlap protocols. Some of the algorithms have results presented for different degrees of overlap, with overlap indicating a modest degree of overlap and OVERLAP indicating the maximum possible. The ringpipe algorithm also has additional send-ahead, send-ahead/overlap, and send-ahead/OVERLAP options, in which the message send and message receive are separated by significant computation. Three separate experimental results are presented for each transpose algorithm. The second uses a Px1 decomposition, representing the parallel Fourier transform. The other two use a 1xP decomposition, representing a parallel Legendre transform, but where the first assumes a distributed parallel Fourier transform while the other assumes a transpose parallel Fourier transform. Note that these three transpose experiments differ primarily in the size of the messages being transmitted.

Statistics summarizing the data are presented below each plot:

A summary of the results is also provided for each platform and message-passing library combination. For each parallel algorithm, we indicate

If contention and process placement do not have significant impact on performance, than the Experiment B and C results will show the same optimal protocols but less relative variation than Experiment A. (Experiments B and C will have increased runtime compared to A, due to the communication costs arising from the fixed parallel algorithm). If contention or process placement are important, then the performance variablity may increase for Experiments B and C, or the optimal protocols may differ between Experiments A, B, and C.

If results for more than one message-passing library are provided for a given platform, then performance differences between the libraries are discussed. In particular, we compare the optimal protocols, the sensitivity to choice of protocol, and the sensitivity to contention and process placement.

Note that timings are for the entire code, and that percentage differences in communication costs will generally be much higher than the percentage differences in total runtime reported for comparable tests cases. Changing the communication protocols does not change the complexity of the runs and does not significantly change the total amount of data moved. What is affected is the number and order of handshaking messages, and whether bidirectional bandwidth or communication/computation overlap are attempted. The one case where the computation may be affected is the distributed FFT dfft. In order to exploit overlap, the block FFT is divided into two blocks. On some machines and with some problem granularities, transforming two small blocks results in a higher computational rate than transforming a single large block. This should be taken into account when comparing dfft protocols.

Also note that when comparing between platforms, the sensitivity is a strong function of the communication/computation ratio, and a platform with a slow processor may show significantly less performance sensitivity than one with a fast processor. Thus high performance sensitivity to the communication protocol is more of a feature than a bug, but it is a feature that should be recognized if goos performance is to be achieved.

RESULTS:

Convex SPP-1200
MPI
Compaq AlphaServer SC
MPI
 
Summary of MPI Results
 
A:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
Cray Research T3D
SHMEM
 
Summary of SHMEM Results
 
A:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
Cray Research T3E-900
MPI
 
Summary of MPI Results
 
A:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
B:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
C:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
SHMEM
 
Summary of SHMEM Results
 
A:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
B:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
C:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
 
   Comparison of MPI and SHMEM Results
 
IBM SP2-66
MPI
 
Summary of MPI Results
 
A1:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
A2:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
MPL
 
Summary of MPL Results
 
A1:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
A2:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
 
   Comparison of MPI and MPL Results
 
IBM SP3-200 Winterhawk I
MPI
(without shared memory implementation for intranode communication)
 
Summary of MPI Results
 
A1:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
A2:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
A3:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
B1:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
B2:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
C:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
 
IBM SP3-200 Winterhawk I
MPI
(with shared memory implementation for intranode communication)
Summary of MPI Results
 
A:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
IBM SP3-375 Winterhawk II
MPI
 
Summary of MPI Results
 
A:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
B:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
C:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
 
Intel Paragon
OSF/MPI
 
Summary of MPI Results
 
A1:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
OSF/NX
 
Summary of NX Results
 
A1:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
A2:    DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
 
   Comparison of OSF/MPI and OSF/NX Results
 
SUNMOS
 
Summary of SUNMOS Results
 
A2:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
 
   Comparison of OSF/NX and SUNMOS Results
 
SGI Origin2000-195
MPI
 
Summary of MPI Results
 
A:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
B:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
C:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
SHMEM
 
Summary of SHMEM Results
 
A:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
B:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
C:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
 
   Comparison of MPI and SHMEM Results
 
SGI Origin2000-250
MPI
 
Summary of MPI Results
 
A:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
B:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
C:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
SHMEM
 
Summary of SHMEM Results
 
A:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
B:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
C:     DFFT EXCHSUM HALFSUM
RINGPIPE RINGSUM
LOGTRANS (1) SRTRANS (1) SWTRANS (1)
LOGTRANS (2) SRTRANS (2) SWTRANS (2)
LOGTRANS (3) SRTRANS (3) SWTRANS (3)
 
   Comparison of MPI and SHMEM Results
 

PSTSWM Performance Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 09:59:46 EDT.
5522 accesses since 1/2/96.