Communication overhead for parallel runs of PSTSWM on a given platform is a
function of the parallel algorithm, the parallel
algorithm implementation, and the problem granularity (problem size and
number of processors). One aspect of the parallel implementation
that we have found to be important on some platforms is the message-passing
protocol. Some indication of the variation that can occur from the choice of
message-passing protcol can be seen from the basic point-to-point
communication performance results described here.
In these studies, we examine the impact of the choice of protocol in more
detail, looking at the effect on the performance of specific parallel
algorithm options in PSTSWM.
In the spectral transform method used in PSTSWM, fields are transformed at
each timestep between the physical (longitude-latitude-vertical)
and the Fourier (wavenumber-latitude-vertical) domains using Fourier
transforms in the longitude direction, and between the Fourier and spectral
(spectral coefficients - vertical) domains using a Legendre transform
in the latitude direction. All parallel algorithms in
PSTSWM are based on decomposing the different computational domains onto a
logical two-dimensional grid of processors, PXx PY.
In each domain two of the domain dimensions are
decomposed across the processor grid, for example, assigning
longitude-latitude patches of the physical domain to individual processors,
but leaving one domain dimension undecomposed.
Two general types of parallel algorithms are used in PSTSWM:
transpose and distributed.
In a transpose algorithm, the decomposition is
``rotated'' before a transform begins, to ensure that all data needed
to compute a particular transform is local to a single processor.
In a distributed algorithm the original decomposition
of the domain is retained, and communication is performed to allow the
processors to cooperate in the calculation of a transform.
Three transpose algorithms are examined, each of which is functionally
equivalent to MPI_ALLTOALLV:
srtrans: sends P-1 messages using SENDRECV to transpose across P
processors;
swtrans: sends P-1 messages using SWAP to transpose across P
processors;
logtrans: sends O(log(P)) messages using SWAP to transpose across
P processors.
srtrans, swtrans, and logtrans all use different
orderings of interprocessor communication between processors, and
logtrans sends more data than the other two.
Each of these are options for both the parallel Fourier and parallel Legendre
transform algorithms.
Four distributed Legendre transform algorithms are examined,
the first three of which are functionally equivalent to MPI_ALLREDUCE:
exchsum: an exchange-based algorithm for distributed
vector sum, implemented using SWAP;
halfsum: a recursive halving-based algorithm for distributed
vector sum, implemented using SWAP;
ringsum: a ring-based algorithm for distributed
vector sum, implemented using SENDRECV;
ringpipe: a pipeline-based algorithm for distributed
vector sum, implemented using SENDRECV.
Finally, one distributed Fast Fourier transform is examined:
dfft: sends O(log (P)) messages using SWAP to
calculate Fourier transform distributed across P processors.
Each of these algorithms can be implemented using
two general classes of protocols: unordered and
ordered. While not all protocols are available for all
message-passing transport layers, they are drawn from those described
below. Examples are given for the SWAP command using MPI commands.
Similar definitions hold for the SENDRECV command. Note that
the examples have been simplified (to save room) and do not accurately
represent the MPI implementations. In particular, handshaking messages
required for correct use of the ready send command have been omitted.
Unordered
 
Ordered
(0,0): simple
Processor 1 and 2
MPI_BSEND
MPI_RECV
(1,0): simple
Processor 1
 
Processor 2
MPI_SEND
 
MPI_RECV
MPI_RECV
 
MPI_SEND
(0,1): nonblocking send
Processor 1 and 2
MPI_ISEND
MPI_RECV
(1,1): nonblocking send
Processor 1
 
Processor 2
MPI_ISEND
 
MPI_RECV
MPI_RECV
 
MPI_SEND
(0,2): nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SEND
(1,2): nonblocking receive
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_SEND
 
MPI_SEND
(0,3): nonblocking send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISEND
(1,3): nonblocking send and receive
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_ISEND
 
MPI_SEND
(0,4): ready send
Processor 1 and 2
MPI_IRECV
MPI_RSEND
(1,4): ready send
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_RSEND
 
MPI_RSEND
(0,5): nonblocking ready send
Processor 1 and 2
MPI_IRECV
MPI_IRSEND
(1,5): nonblocking ready send
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_IRSEND
 
MPI_RSEND
(0,6): native sendrecv
Processor 1 and 2
MPI_SENDRECV
 
 
 
(1,6): synchronous
Processor 1
 
Processor 2
-
 
MPI_RECV
MPI_SEND
 
-
-
 
MPI_SEND
MPI_RECV
 
-
Two different types of implementations are also supported.
The first uses the basic SWAP and SENDRECV command to exchange the data. The
second reorders the elements of the SWAP or SENDRECV protocol to prepost the
nonblocking receives and delay waiting for the nonblocking commands to
complete, in an attempt to overlap communication with computation and
hide communication latency.
These algorithms and protocols are described in more detail in
P. H. Worley and B. Toonen,
A users' guide to PSTSWM, ORNL Technical Report
ORNL/TM-12779, July 1995.
To examine the performance issues in these different implementation options,
we run one or more of the following experiments.
A: We use one-dimensional parallelizations of
the form 8x1 or 1x8, 16x1 or 1x16, and 32x1 or 1x32, where the first
decomposition in each pair is for examining parallel Fourier transform
algorithms, and the second is for examining parallel Lengendre transform
algorithms. The problem sizes are based on T42L16 (64x128x16 physical domain
grid) and T85L32 (128x256x32 physical domain grid) as they would appear using
a two-dimensional parallelization
of the form 8x8, 16x32, or 32x16. This is accomplished by
modifying the problem size to achieve the desired granularity
(problem size per processor).
Experiment A allows us
to examine the performance of individual parallel algorithms
in isolation
for problem granularities that are typical of what would be seen in production.
In actual practice, each parallel algorithm will run on subsets of
processors in a two-dimensional virtual processor grid.
Thus Experiment A does not take into account the performance effect of
a realistic assignment of processes to processors when the
virtual processor grid does not match the physical interconnection
topology, nor the contention caused by multiple subsets running the same
parallel algorithm.
This experiment also overestimates the importance of the performance
variation due to the choice of communication protocol.
The measured runtimes will be smaller than those for the corresponding "real"
two-dimensional runs since they do not include the communication cost for the
parallel algorithm in the other direction.
B: We use two-dimensional parallelizations of the forms
8x8, 16x4 or 4x16, and 32x2 or 2x32. The "larger" dimension is used
to examine the parallel Fourier or Legendre transform options, while a
fixed parallel algorithm is used for the "smaller" dimension.
The parallel Fourier transform routines are paired with ringsum
using the (1,6) communication protocol.
The parallel Legendre transforms are paired with either dfft using
(1,6) or swtrans using (1,6). A row-major processor ordering is
used, with processor row 1 assigned processors {0,...,PX-1},
processor row 2 assigned processors {PX,...,2*PX-1},
etc. As in experiment A, problem sizes are based on T42L16
and T85L32 as they would appear on a two dimensional grid of size 8x8, 16x32,
or 32x16.
Experiment B allows us to examine one example of the contention
for resources when running multiple copies of the parallel algorithm.
It is not the same amount of contention as would be seen in a real run
for the 16 and 32 processor cases, but it should provide us with some
information.
Note that this experiment underestimates the performance variation due to the
communication protocol for the 8x8 case in that the execution in a realistic
run will be smaller. The default parallel algorithm for the other dimension
uses a communication protocol that is generally one of the worst
performers. It was chosen because it is the only option that is supported
across all of the message passing libraries that PSTSWM supports.
C: We use the same two-dimensional decompositions, default
parallel algorithms, and problem sizes as in experiment B, but
now use a column-major processor ordering. That is, processor column 1 is
assigned processors {0,...,PY-1},
processor column 2 is assigned processors
{PY,...,2*PY-1}, etc.
This generates a different process placement and potentially different
contention.
Results are presented as scatterplots for each parallel platform, set of
experiments, parallel algorithm, problem size, and number of processors.
The x-axis is the protocol option (0-6) and the y-axis is the execution time
for 12 timesteps of PSTSWM using the standard benchmark problem case
(global steady state nonlinear zonal geostrophic flow).
Each x-axis tick mark will have results for unordered, unordered/overlap;
ordered, and ordered/overlap protocols. Some of the algorithms have results
presented for different degrees of overlap, with overlap indicating a
modest degree of overlap and OVERLAP indicating the maximum possible.
The ringpipe algorithm also has additional send-ahead,
send-ahead/overlap, and send-ahead/OVERLAP options, in which the message send
and message receive are separated by significant computation.
Three separate experimental results are presented
for each transpose algorithm. The second uses a Px1 decomposition,
representing the parallel Fourier transform. The other two use a 1xP
decomposition, representing a parallel Legendre transform, but where the
first assumes a distributed parallel Fourier transform while the other
assumes a transpose parallel Fourier transform.
Note that these three transpose experiments differ primarily
in the size of the messages being transmitted.
Statistics summarizing the data are presented below each plot:
minimum observed runtime
three protocols with minimum runtimes (in order)
measures of "spread" of runtimes:
relative difference between mean and minimum
relative difference between median and minimum
relative difference between maximum and minimum
measures of sensitivity to choice of protocol:
number of protocols whose runtimes are within 1% of minimum
number of protocols whose runtimes are within 5% of minimum
number of protocols whose runtimes are within 25% of minimum
A summary of the results is also provided for each platform and
message-passing library combination. For each parallel algorithm, we indicate
the best protocols,
the importance of chosing good protocols
(looking at the performance variation caused by varying the
communication protocols and at the number of protocols with runtimes
near the optimal values), and
the differences between the results in Experiments A, B, and C.
If contention and process placement do not have significant impact on
performance, than the Experiment B and C results will show the same optimal
protocols but less relative variation than Experiment A. (Experiments B and C
will have increased runtime compared to A, due to the communication costs
arising from the fixed parallel algorithm). If contention or process
placement are important, then the performance variablity may increase
for Experiments B and C, or the optimal protocols may differ between
Experiments A, B, and C.
If results for more than one message-passing library are provided for a
given platform, then performance differences between the libraries are
discussed. In particular, we compare the optimal protocols, the
sensitivity to choice of protocol, and the sensitivity to contention and
process placement.
Note that timings are for the entire code, and that percentage differences in
communication costs will generally be much higher than the percentage
differences in total runtime reported for comparable tests cases.
Changing the communication protocols does not change the complexity of the
runs and does not significantly change the total amount of data moved.
What is affected is the number and order of handshaking messages, and whether
bidirectional bandwidth or communication/computation overlap are attempted.
The one case where the computation may be affected is the distributed FFT
dfft. In order to exploit overlap, the block FFT is divided into two
blocks. On some machines and with some problem granularities, transforming
two small blocks results in a higher computational rate than transforming a
single large block. This should be taken into account when comparing
dfft protocols.
Also note that when comparing between platforms, the sensitivity is a strong
function of the communication/computation ratio, and a platform
with a slow processor may show significantly less performance sensitivity
than one with a fast processor. Thus high performance sensitivity to the
communication protocol is more of a feature than a bug, but it is a feature
that should be recognized if goos performance is to be achieved.