Communication overhead is best measured in the context of the full
code, but we have found it useful to establish a performance baseline by
determining the "peak achieveable" point-to-point interprocessor
communication performance.
The program COMMTEST is used to determine this
baseline, as described below.
Performance-critical interprocessor communication in
PSTSWM and
CCM/MP-2D is implemented
using two basic types of commands: SWAP and SENDRECV. The message-passing
transport layer used to implement these commands is specified at compile
time, while the protocol used is specified at runtime.
To characterize the basic communication capabilities of the different
platforms and message-passing layers in terms relevant to PSTSWM and CCM/MP-2D,
we use the PSTSWM SWAP commands.
Using the SWAP commands adds an extra one or two subroutine
calls to the overhead of calling the underlying transport layer, and multiple
"native" commands may be required to implement the SWAP semantics. Thus the
measurements will not necessarily agree with the "ping pong" or "ping ping"
measurements reported by other researchers, but they should be
comparable. More importantly, our measurements correspond exactly to the
basic interprocessor communication primitives in PSTSWM and CCM/MP-2D, and
should be consistent and fair across the different platforms.
PSTSWM and CCM/MP-2D performance are more sensitive to bandwidth than to
latency, and the primary focus of these experiments is on determining the
achievable bandwidth
when exchanging moderate to large size messages. To achieve this, we measure
the time required to exchange 262144 64-bit floating point numbers between two
neighboring processors. The experiments vary the packet size/number of
messages used to exchange the information and the protocol used for the
exchange. The smallest message sent is 2KBytes (256 REAL*8 values), and the
largest is 2MBytes. In the results that follow, we refer to these as the
2MB experiments. For completeness sake, we also measure the
time to swap 1024 and 16384 64-bit values, using message sizes ranging from 8 bytes
to 8KBytes and from 128 bytes to 128KBytes, respectively. We refer to these
as the 8KB and 128KB experiments.
By varying the number of messages and message size, but leaving the total
volume fixed, we can easily observe both the constant overhead in sending and
receiving messages (looking at differences between measurements) and the
achieved bandwidth. If communication
performance satisfies a "latency, bandwidth" linear model, these parameters
would be determined exactly, but the effects of changes in protocol, contention for shared
resources, and the memory hierarchy make the observed parameters functions of
the message packet size (and the state of the system). Spot estimates of
the parameters are still of interest, and the minimum observed execution time
can be used to calculate the maximum achieved bandwidth for a given experiment.
These experiments also have the practical advantage of
not requiring the measurement of either very small or very large execution
times. For the larger message counts, the experiments are similar to the
typical measurements of communication cost using repeated sends and receives.
Note however that they move linearly through memory, with each send and
receive accessing unique memory locations. For the smaller message counts,
the larger overhead of the "first" call (instruction cache miss) becomes more
noticeable, but the smaller message counts involve the largest messages, so
the data movement itself should dominate the cost. (To confirm this, we run
separate experiments in which each send and receive is repeated some number
of times, as described later.) There is also some intrinsic interest in
determining whether it is better to explicitly send a large message in
packets, or whether the message-passing transport layer takes care of such
optimization issues automatically.
Most of the results below correspond to measuring the SWAP performance
between either logical processors 0 and 1 in a two processor partition
or logical processors 0 and 2 in a four processor partition, using
the one that maximizes measured performance.
For newer platforms, additional results are presented. For example,
the IBM SP3 system using Winterhawk I nodes has two processor SMP nodes
interconnected by a switch. Different message passing protocols are used
between processors in a node and between processors in different nodes.
Also, both processors in a node share access to the external network.
To better understand message-passing performance on this architecture
we test both the 0-1 (intranode) and 0-2 (internode) SWAP
performance. We also test simultaneous 0-2, 1-3 SWAP performance,
to measure the effect of contention for the interface to the network.
Two general classes of protocols are used: unordered (ping-ping) and
ordered (ping-pong). While not all protocols are available for all
message-passing transport layers, they are drawn from those described
below. Examples are given using MPI commands. Note that the examples have
been simplified (to save room) and do not accurately represent the MPI
implementations. In particular, handshaking messages required for correct use
of the ready send command have been omitted.
Unordered
 
Ordered
(0,0): simple
Processor 1 and 2
MPI_BSEND
MPI_RECV
(1,0): simple
Processor 1
 
Processor 2
MPI_SEND
 
MPI_RECV
MPI_RECV
 
MPI_SEND
(0,1): nonblocking send
Processor 1 and 2
MPI_ISEND
MPI_RECV
(1,1): nonblocking send
Processor 1
 
Processor 2
MPI_ISEND
 
MPI_RECV
MPI_RECV
 
MPI_SEND
(0,2): nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SEND
(1,2): nonblocking receive
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_SEND
 
MPI_SEND
(0,3): nonblocking send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISEND
(1,3): nonblocking send and receive
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_ISEND
 
MPI_SEND
(0,4): ready send
Processor 1 and 2
MPI_IRECV
MPI_RSEND
(1,4): ready send
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_RSEND
 
MPI_RSEND
(0,5): nonblocking ready send
Processor 1 and 2
MPI_IRECV
MPI_IRSEND
(1,5): nonblocking ready send
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_IRSEND
 
MPI_RSEND
(0,6): native sendrecv
Processor 1 and 2
MPI_SENDRECV
 
 
 
(1,6): synchronous
Processor 1
 
Processor 2
-
 
MPI_RECV
MPI_SEND
 
-
-
 
MPI_SEND
MPI_RECV
 
-
(0,7): nonblocking sync. send
Processor 1 and 2
MPI_ISSEND
MPI_RECV
(1,7): nonblocking sync. send
Processor 1
 
Processor 2
MPI_ISSEND
 
MPI_RECV
MPI_RECV
 
MPI_SSEND
(0,8): sync. send and nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SSEND
(1,8): sync. send and nonblocking receive
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_SSEND
 
MPI_SSEND
(0,9): nonblocking sync. send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISSEND
(1,9): nonblocking sync. send and receive
Processor 1
 
Processor 2
MPI_IRECV
 
MPI_RECV
MPI_ISSEND
 
MPI_SSEND
 
 
 
 
(1,10): sync. send
Processor 1
 
Processor 2
MPI_SSEND
 
MPI_RECV
MPI_RECV
 
MPI_SSEND
These protocols are described in more detail in
P. H. Worley and B. Toonen,
A users' guide to PSTSWM, ORNL Technical Report
ORNL/TM-12779, July 1995.
Two different sets of experiments are performed. The first uses the basic
SWAP command to exchange the data. The second reorders the elements of the
SWAP protocol to prepost the nonblocking receives and delay waiting for the
nonblocking commands to complete, in an attempt to overlap some of the
start-up costs of one swap with the actual transmission time of another. The
reorganization may also prevent some contention for shared resources when
simultaneously sending and receiving. The logic of the reorganized test is as
follows:
Prepost one receive request.
Synchronize.
Begin timing.
for i = 1, number of messages
Prepost receive request for next message.
Begin send and complete receive for current message.
Complete send for previous message.
Stop timing.
Both sets of experiments are also conducted in up to
three different ways:
Before timing begins, the L2 cache is invalidated, the message buffer
is initialized, and a barrier is called. Each message is sent once.
Before timing begins, the message buffer
is initialized and a barrier is called, but the L2 cache is not
invalidated first. Each message is sent once.
Before timing begins, the message buffer
is initialized and a barrier is called, but the L2 cache is
not invalidated first.
Each message is sent multiple times, and the time is divided by the
number of iterations. The
iteration loops are inner loops. For example, step (4.1) would be repeated
10 times, followed by (4.2), etc. Also, the iteration is over the same
memory and data, so data cache misses will be reduced when iterating
if the message size is small.
In all cases, the maximum execution time observed by a single processor
is reported. When a global clock is available, the elapsed time
between the first processor beginning the experiment and the last
processor completing the experiment is recorded, and significant
discrepancies from the single processor timing are noted.
The following results contain the graphs of total time versus number of
messages and MBytes/second versus the message size for each protocol and
experiment, as well as some simple statistics:
b  =  (min TN)/(total number of bytes sent)
and
a  =  (TN2-TN1)/(N2-N1).
All counts are for a single procssor, and
TN is the time measured when exchanging the data using
N SWAPs. Typically we use N2 = 1024 and N1 = 512 for computing
a, but different choices are made if
T1024 or T512 is not representative of
the rest of the data for a given platform.
The statistics are reported differently for the ordered and unordered
protocols:
For the unordered protocols, messages between the two processors are
(logically) sent simultaneously. We denote a to be the "latency",
representing an estimate of the constant overhead associated with sending and
receiving a message. (When the smallest message is 2KBytes, this latency
estimate has little relationship to the time required to send a zero or 1
byte message, as latency is often defined.)
We denote 2/b to be the bidirectional "swap" bandwidth, representing
the total capacity for moving data in an unordered SWAP.
The unidirectional "busy" bandwidth is half this, representing
the capacity for sending data while simultaneously receiving data.
For the ordered protocols, messages are not overlapped,
and the number of sends and data volume need to be doubled in the
analysis. Therefore, a/2 is denoted the "latency". We denote
2/b to be the unidirectional "idle" bandwidth, representing
the capacity for sending data while nothing else is happening.
Note that 2/b is also the "swap" bandwidth, representing
the total capacity for moving data in an ordered SWAP.
We also present the error in using our latency and bandwidth
statistics in the standard linear model for communication cost
a*(number of messages) + b*(total number of bytes sent)
to approximate the data. Note, however, that
two parameter linear models are inaccurate for message-passing libraries
that switch the underlying protocols based on the message size,
simple linear models cannot capture the complex memory hierarchy
interactions in current architectures,
simple linear models cannot capture the contention that can occur
in simultaneous sends and receives, and
this model does not represent a standard approximation
to the data.
The model is applied purely for curiosity's sakes, and the error in the
model does not reflect on the accuracy of the individual statistics.
When it is clear that there is a change of protocol involved, the
model error may be calculated on a subset of the data corresponding
to the dominant protocol. When this occurs, it will be noted on the
results page. Otherwise, the model error is computed over all the data.