Compaq AlphaServer SC Parallel Algorithms for CCM/MP-2D
(Results based on November, 1999 PSTSWM Experiments)
The AlphaServer SC is a distributed-memory parallel
architecture utilizing high-end workstation-class processors
interconnected by a Quadrics (fat-tree) switch.
The SC used in these experiments was a Compaq development
machine with 8 4-way SMP nodes.
We used the results of our November, 1999 studies of
PSTSWM performance
to identify the following parallel algorithms for CCM/MP-2D
on the AlphaServer SC.
When referring to the parallel algorithms and their implementations, we
use the following shorthand:
The name of the individual parallel algorithms, e.g., srtrans,
is described on the PSTSWM protocol web pages.
"overlap" refers to a communication protocol that posts
one or more send or receive requests as early as possible, in an attempt to
overlap communication with computation or to hide latency. The default
case is "no overlap".
"ordered" refers to a communication protocol that does not
attempt to exploit bidirectional bandwidth in a swap or send-receive operation,
instead having one processor send and the other receive, followed by the
reverse when the first send is complete. The default is "unordered", i.e.,
logically sending both directions simultaneously.
"ready send" refers to a communication protocol that uses
the MPI_RSEND or MPI_IRSEND and additional handshaking messages
to control precisely when messages are sent and received.
A parallel algorithm for CCM/MP-2D is specified as a vector
consisting of the codes for the individual parallel algorithms, in the
following order:
The parallel algorithms chosen for the AlphaServer SC experiments
are listed below.
Seven different algorithms were examined, three distributed
FT/distributed Legendre transform algorithms:
d0: (df0 , dl0 , ds0 , lb0)
d1: (df1 , dl1 , ds1 , lb1)
d2: (df2 , dl2 , ds2 , lb2)
and four transpose FFT/distributed Legendre transform algorithms:
t0: (tf0 , dl0 , ds0 , lb0)
t1: (tf0 , dl1 , ds0 , lb0)
t2: (tf1 , dl1 , ds1 , lb1)
t3: (tf2 , dl2 , ds2 , lb2)
where the codes for the individual parallel algorithms are as follows:
lb2 - MPI_SENDRECV-based communication protocol: (0,6)
d0, t0, and t1 use protocols that the PSTSWM
experiments indicate are best for "small" granularity
problems, while d1 and t2 use protocols that are best for
"large" grain problems. d2 and t3 are the standard MPI
protocols that one would choose without tuning.