IBM SP3-200 Winetrhawk I Parallel Algorithms for CCM/MP-2D
(Results based on June, 1999 PSTSWM Experiments)
The IBM SP3 is a distributed-memory parallel
architecture utilizing high-end workstation-class processors
interconnected by a bidirectional multistage switch.
The SP3 used in these experiments was the first stage of
a larger machine being cited at Oak Ridge National Laboratory.
We used the results of our June, 1999 studies of
PSTSWM performance
to identify the following parallel algorithms for CCM/MP-2D
on the SP3-200.
When referring to the parallel algorithms and their implementations, we
use the following shorthand:
The name of the individual parallel algorithms, e.g., srtrans,
is described on the PSTSWM protocol web pages.
"overlap" refers to a communication protocol that posts
one or more send or receive requests as early as possible, in an attempt to
overlap communication with computation or to hide latency. The default
case is "no overlap".
"ordered" refers to a communication protocol that does not
attempt to exploit bidirectional bandwidth in a swap or send-receive operation,
instead having one processor send and the other receive, followed by the
reverse when the first send is complete. The default is "unordered", i.e.,
logically sending both directions simultaneously.
"ready send" refers to a communication protocol that uses
the MPI_RSEND or MPI_IRSEND and additional handshaking messages
to control precisely when messages are sent and received.
A parallel algorithm for CCM/MP-2D is specified as a vector
consisting of the codes for the individual parallel algorithms, in the
following order:
The parallel algorithms chosen for the SP3-200 experiments
are listed below.
Ten different algorithms were examined, four distributed
FFT/distributed Legendre transform algorithms:
d0: (df0 , dl0 , ds0 , lb0)
d1: (df1 , dl1 , ds1 , lb1)
d2: (df2 , dl1 , ds0 , lb0)
d3: (df1 , dl1 , ds2 , lb2)
and six transpose FFT/distributed Legendre transform algorithms:
t0: (tf0 , dl0 , ds0 , lb0)
t1: (tf1 , dl1 , ds0 , lb0)
t2: (tf2 , dl1 , ds0 , lb0)
t4: (tf0 , dl1 , ds0 , lb0)
t5: (tf1 , dl1 , ds1 , lb1)
t6: (tf2 , dl1 , ds2 , lb2)
where the codes for the individual parallel algorithms are as follows:
Distributed FFT
df0 - MPI_SENDRECV-based communication protocol: (0,6)