COMMTEST Communication Characterization

Performance Studies using

COMMTEST


Point-to-Point Message-Passing Performance

Communication overhead is best measured in the context of the full code, but we have found it useful to establish a performance baseline by determining the "peak achieveable" point-to-point interprocessor communication performance. The program COMMTEST is used to determine this baseline, as described below.

Performance-critical interprocessor communication in PSTSWM and CCM/MP-2D is implemented using two basic types of commands: SWAP and SENDRECV. The message-passing transport layer used to implement these commands is specified at compile time, while the protocol used is specified at runtime.

To characterize the basic communication capabilities of the different platforms and message-passing layers in terms relevant to PSTSWM and CCM/MP-2D, we use the PSTSWM SWAP commands. Using the SWAP commands adds an extra one or two subroutine calls to the overhead of calling the underlying transport layer, and multiple "native" commands may be required to implement the SWAP semantics. Thus the measurements will not necessarily agree with the "ping pong" or "ping ping" measurements reported by other researchers, but they should be comparable. More importantly, our measurements correspond exactly to the basic interprocessor communication primitives in PSTSWM and CCM/MP-2D, and should be consistent and fair across the different platforms.

PSTSWM and CCM/MP-2D performance are more sensitive to bandwidth than to latency, and the primary focus of these experiments is on determining the achievable bandwidth when exchanging moderate to large size messages. To achieve this, we measure the time required to exchange 262144 64-bit floating point numbers between two neighboring processors. The experiments vary the packet size/number of messages used to exchange the information and the protocol used for the exchange. The smallest message sent is 2KBytes (256 REAL*8 values), and the largest is 2MBytes. In the results that follow, we refer to these as the 2MB experiments. For completeness sake, we also measure the time to swap 1024 and 16384 64-bit values, using message sizes ranging from 8 bytes to 8KBytes and from 128 bytes to 128KBytes, respectively. We refer to these as the 8KB and 128KB experiments.

By varying the number of messages and message size, but leaving the total volume fixed, we can easily observe both the constant overhead in sending and receiving messages (looking at differences between measurements) and the achieved bandwidth. If communication performance satisfies a "latency, bandwidth" linear model, these parameters would be determined exactly, but the effects of changes in protocol, contention for shared resources, and the memory hierarchy make the observed parameters functions of the message packet size (and the state of the system). Spot estimates of the parameters are still of interest, and the minimum observed execution time can be used to calculate the maximum achieved bandwidth for a given experiment.

These experiments also have the practical advantage of not requiring the measurement of either very small or very large execution times. For the larger message counts, the experiments are similar to the typical measurements of communication cost using repeated sends and receives. Note however that they move linearly through memory, with each send and receive accessing unique memory locations. For the smaller message counts, the larger overhead of the "first" call (instruction cache miss) becomes more noticeable, but the smaller message counts involve the largest messages, so the data movement itself should dominate the cost. (To confirm this, we run separate experiments in which each send and receive is repeated some number of times, as described later.) There is also some intrinsic interest in determining whether it is better to explicitly send a large message in packets, or whether the message-passing transport layer takes care of such optimization issues automatically.

Most of the results below correspond to measuring the SWAP performance between either logical processors 0 and 1 in a two processor partition or logical processors 0 and 2 in a four processor partition, using the one that maximizes measured performance. For newer platforms, additional results are presented. For example, the IBM SP3 system using Winterhawk I nodes has two processor SMP nodes interconnected by a switch. Different message passing protocols are used between processors in a node and between processors in different nodes. Also, both processors in a node share access to the external network. To better understand message-passing performance on this architecture we test both the 0-1 (intranode) and 0-2 (internode) SWAP performance. We also test simultaneous 0-2, 1-3 SWAP performance, to measure the effect of contention for the interface to the network.

Two general classes of protocols are used: unordered (ping-ping) and ordered (ping-pong). While not all protocols are available for all message-passing transport layers, they are drawn from those described below. Examples are given using MPI commands. Note that the examples have been simplified (to save room) and do not accurately represent the MPI implementations. In particular, handshaking messages required for correct use of the ready send command have been omitted.

Unordered

 

Ordered

(0,0): simple
Processor 1 and 2
MPI_BSEND
MPI_RECV
(1,0): simple
Processor 1 Processor 2
MPI_SEND MPI_RECV
MPI_RECV MPI_SEND
(0,1): nonblocking send
Processor 1 and 2
MPI_ISEND
MPI_RECV
(1,1): nonblocking send
Processor 1 Processor 2
MPI_ISEND MPI_RECV
MPI_RECV MPI_SEND
(0,2): nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SEND
(1,2): nonblocking receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_SEND MPI_SEND
(0,3): nonblocking send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISEND
(1,3): nonblocking send and receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_ISEND MPI_SEND
(0,4): ready send
Processor 1 and 2
MPI_IRECV
MPI_RSEND
(1,4): ready send
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_RSEND MPI_RSEND
(0,5): nonblocking ready send
Processor 1 and 2
MPI_IRECV
MPI_IRSEND
(1,5): nonblocking ready send
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_IRSEND MPI_RSEND
(0,6): native sendrecv
Processor 1 and 2
MPI_SENDRECV
 
 
 
(1,6): synchronous
Processor 1 Processor 2
- MPI_RECV
MPI_SEND -
- MPI_SEND
MPI_RECV -
(0,7): nonblocking sync. send
Processor 1 and 2
MPI_ISSEND
MPI_RECV
(1,7): nonblocking sync. send
Processor 1 Processor 2
MPI_ISSEND MPI_RECV
MPI_RECV MPI_SSEND
(0,8): sync. send and nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SSEND
(1,8): sync. send and nonblocking receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_SSEND MPI_SSEND
(0,9): nonblocking sync. send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISSEND
(1,9): nonblocking sync. send and receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_ISSEND MPI_SSEND
 
 
 
 
(1,10): sync. send
Processor 1 Processor 2
MPI_SSEND MPI_RECV
MPI_RECV MPI_SSEND
These protocols are described in more detail in

available from here.

Two different sets of experiments are performed. The first uses the basic SWAP command to exchange the data. The second reorders the elements of the SWAP protocol to prepost the nonblocking receives and delay waiting for the nonblocking commands to complete, in an attempt to overlap some of the start-up costs of one swap with the actual transmission time of another. The reorganization may also prevent some contention for shared resources when simultaneously sending and receiving. The logic of the reorganized test is as follows:

  1. Prepost one receive request.
  2. Synchronize.
  3. Begin timing.
  4. for i = 1, number of messages
    1. Prepost receive request for next message.
    2. Begin send and complete receive for current message.
    3. Complete send for previous message.
  5. Stop timing.

Both sets of experiments are also conducted in up to three different ways:

In all cases, the maximum execution time observed by a single processor is reported. When a global clock is available, the elapsed time between the first processor beginning the experiment and the last processor completing the experiment is recorded, and significant discrepancies from the single processor timing are noted.

The following results contain the graphs of total time versus number of messages and MBytes/second versus the message size for each protocol and experiment, as well as some simple statistics:

b  =  (min TN)/(total number of bytes sent)
and
a  =  (TN2-TN1)/(N2-N1).
All counts are for a single procssor, and TN is the time measured when exchanging the data using N SWAPs. Typically we use N2 = 1024 and N1 = 512 for computing a, but different choices are made if T1024 or T512 is not representative of the rest of the data for a given platform.

The statistics are reported differently for the ordered and unordered protocols:

We also present the error in using our latency and bandwidth statistics in the standard linear model for communication cost

a*(number of messages) + b*(total number of bytes sent)
to approximate the data. Note, however, that The model is applied purely for curiosity's sakes, and the error in the model does not reflect on the accuracy of the individual statistics. When it is clear that there is a change of protocol involved, the model error may be calculated on a subset of the data corresponding to the dominant protocol. When this occurs, it will be noted on the results page. Otherwise, the model error is computed over all the data.

RESULTS:

Platform Comparisons

(current systems)

IBM p690 (32 processor Turbo SMP node)
MPI (intranode: processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-2)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-16)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-1, 2-3, ..., 30-31 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-16, 1-17, ..., 15-31 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processor 0-1-2-3-4-5-6-7-...-30-31-0 ring)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-1 in 8-processor LPAR)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-2 in 8-processor LPAR)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-1, 2-3, ..., 6-7 simultaneously in 8-processor LPAR)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-1 in four 8-processor LPARs (same p690) simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-2 in four 8-processor LPARs (same p690) simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
Summary
Compaq AlphaServer SC (1000 MHz)
MPI (intranode: processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-4)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-1, 2-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-4, 1-5, 2-6, 3-7 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intra/internode: processor 0-1-2-3-4-5-6-7-0 ring)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
Summary
Compaq AlphaServer SC (667 MHz)
MPI (intranode: processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-4)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-1, 2-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-4, 1-5, 2-6, 3-7 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intra/internode: processor 0-1-2-3-4-5-6-7-0 ring)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (intranode: processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (internode: processors 0-4)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (intranode: processors 0-1, 2-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (internode: processors 0-4, 1-5, 2-6, 3-7 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (intra/internode: processor 0-1-2-3-4-5-6-7-0 ring)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
Summary
Cray Research T3E-900
MPI (processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
MPI (processors 0-2, 1-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
SHMEM (processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
SHMEM (processors 0-2, 1-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
Summary
IBM SP3-200 Winterhawk I
MPI (intranode: processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-2)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-2, 1-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
Summary
IBM SP3-375 Winterhawk II
MPI (intranode: processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-4)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-1, 2-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-4, 1-5, 2-6, 3-7 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intra/internode: processor 0-1-2-3-4-5-6-7-0 ring)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
Summary
SGI Origin2000-250
MPI (processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
MPI (processors 0-2)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
MPI (processors 0-2, 1-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
SHMEM (processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
SHMEM (processors 0-2)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
SHMEM (processors 0-2, 1-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
Summary

(old systems)

Compaq AlphaServer SC (500 MHz)
MPI (intranode: processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-4)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intranode: processors 0-1, 2-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (internode: processors 0-4, 1-5, 2-6, 3-7 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
MPI (intra/internode: processor 0-1-2-3-4-5-6-7-0 ring)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (intranode: processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (internode: processors 0-4)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (intranode: processors 0-1, 2-3 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (internode: processors 0-4, 1-5, 2-6, 3-7 simultaneously)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
SHMEM (intra/internode: processor 0-1-2-3-4-5-6-7-0 ring)
2MB - Unordered   128KB - Unordered   8KB - Unordered   Unordered Summary  
2MB - Ordered   128KB - Ordered   8KB - Ordered   Ordered Summary  
Summary
Convex SPP-1200
MPI (processors 0-1)
2MB - Unordered
2MB - Ordered
Cray Research T3D
SHMEM (processors 0-2)
2MB - Unordered
2MB - Ordered
HP/Convex SPP-2000
MPI (processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
IBM SP2-66
MPI (processors 0-1)
2MB - Unordered  
2MB - Ordered  
Intel Paragon XPS/150 MP
OSF/MPI (processors 0-1)
2MB - Unordered  128KB - Unordered  8KB - Unordered 
2MB - Ordered   128KB - Ordered   8KB - Ordered  
OSF/NX (processors 0-1)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
Intel Paragon XPS/140 GP
SUNMOS (processors 0-1)
1MB - Unordered  
1MB - Ordered  
SGI Origin2000-195
MPI (processors 0-2)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  
SHMEM (processors 0-2)
2MB - Unordered   128KB - Unordered   8KB - Unordered  
2MB - Ordered   128KB - Ordered   8KB - Ordered  

COMMTEST Performance Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 09:59:57 EDT.
8303 accesses since 1/2/96.