COMMTEST Performance Studies

Performance Studies using

COMMTEST


COMMTEST is a message-passing benchmark code that measures the time to send messages between one or more processors using the same communication primitives used by PSTSWM and CCM/MP-2D.

Performance-critical interprocessor communication in PSTSWM and CCM/MP-2D is implemented using two basic types of commands: SWAP and SENDRECV. The message-passing transport layer used to implement these commands is specified at compile time, while the protocol used is specified at runtime.

COMMTEST measures the performance of the SWAP and SENDRECV commands, varying the number of communication requests for a fixed data volume. For example, if a 2MB message is being exchanged, this can be sent as a single message of size 2MB or as 1024 "packets", each of size 2KB. COMMTEST takes as input both the total message size and the minimum packet length. It then runs multiple experiments, sending the message as one packet, as two packets, etc., until it sends messages using the minimum packet length.

By varying the number of messages and message size, but leaving the total volume fixed, we can easily observe both the constant overhead in sending and receiving messages (looking at differences between measurements) and the achieved bandwidth. If communication performance satisfies a "latency, bandwidth" linear model, these parameters would be determined exactly, but the effects of changes in protocol, contention for shared resources, and the memory hierarchy make the observed parameters functions of the message packet size (and the state of the system). Spot estimates of the parameters are still of interest, and the minimum observed execution time can be used to calculate the maximum achieved bandwidth for a given experiment.

This approach also has the practical advantage of not requiring the measurement of either very small or very large execution times. For the larger message counts, the experiments are similar to the typical measurements of communication cost using repeated sends and receives. Note however that they move linearly through memory, with each send and receive accessing unique memory locations. For the smaller message counts, the larger overhead of the "first" call (instruction cache miss) becomes more noticeable, but the smaller message counts involve the largest messages, so the data movement itself should dominate the cost. There is also some intrinsic interest in determining whether it is better to explicitly send a large message in packets, or whether the message-passing transport layer takes care of such optimization issues automatically.

Two general classes of protocols are used: unordered (ping-ping) and ordered (ping-pong). While not all protocols are available for all message-passing transport layers, they are drawn from those described below. Examples are given using MPI commands for SWAP. Note that the examples have been simplified (to save room) and do not accurately represent the MPI implementations. In particular, handshaking messages required for correct use of the ready send command have been omitted.

Unordered

 

Ordered

(0,0): simple
Processor 1 and 2
MPI_BSEND
MPI_RECV
(1,0): simple
Processor 1 Processor 2
MPI_SEND MPI_RECV
MPI_RECV MPI_SEND
(0,1): nonblocking send
Processor 1 and 2
MPI_ISEND
MPI_RECV
(1,1): nonblocking send
Processor 1 Processor 2
MPI_ISEND MPI_RECV
MPI_RECV MPI_SEND
(0,2): nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SEND
(1,2): nonblocking receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_SEND MPI_SEND
(0,3): nonblocking send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISEND
(1,3): nonblocking send and receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_ISEND MPI_SEND
(0,4): ready send
Processor 1 and 2
MPI_IRECV
MPI_RSEND
(1,4): ready send
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_RSEND MPI_RSEND
(0,5): nonblocking ready send
Processor 1 and 2
MPI_IRECV
MPI_IRSEND
(1,5): nonblocking ready send
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_IRSEND MPI_RSEND
(0,6): native sendrecv
Processor 1 and 2
MPI_SENDRECV
 
 
 
(1,6): synchronous
Processor 1 Processor 2
- MPI_RECV
MPI_SEND -
- MPI_SEND
MPI_RECV -
(0,7): nonblocking sync. send
Processor 1 and 2
MPI_ISSEND
MPI_RECV
(1,7): nonblocking sync. send
Processor 1 Processor 2
MPI_ISSEND MPI_RECV
MPI_RECV MPI_SSEND
(0,8): sync. send and nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SSEND
(1,8): sync. send and nonblocking receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_SSEND MPI_SSEND
(0,9): nonblocking sync. send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISSEND
(1,9): nonblocking sync. send and receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_ISSEND MPI_SSEND
 
 
 
 
(1,10): sync. send
Processor 1 Processor 2
MPI_SSEND MPI_RECV
MPI_RECV MPI_SSEND
These protocols are described in more detail in

available from here.

COMMTEST is actually two separate programs. The first uses the basic SWAP or SENDRECV command to exchange the data. The second reorders the elements of the SWAP or SENDRECV protocol to prepost the nonblocking receives and delay waiting for the nonblocking commands to complete, in an attempt to overlap some of the start-up costs of one swap with the actual transmission time of another. The reorganization may also prevent some contention for shared resources when simultaneously sending and receiving. The logic of the reorganized test is as follows:

  1. Prepost one receive request.
  2. Synchronize.
  3. Begin timing.
  4. for i = 1, number of messages
    1. Prepost receive request for next message.
    2. Begin send and complete receive for current message.
    3. Complete send for previous message.
  5. Stop timing.

Both programs can be run in three different ways:

In all cases, the maximum execution time observed by a single processor is reported. When a global clock is available, the elapsed time between the first processor beginning the experiment and the last processor completing the experiment is recorded, and significant discrepancies from the single processor timing are noted.

In addition to total message size, minimum packet length, number of iterations, communication protocols, and whether to use SWAP or SENDRECV, COMMTEST input parameters include the distance between communicating nodes and whether all nodes are communicating or just node zero and its partners. For example, two nodes can be used when exchanging messages in a SWAP test, or all pairs of nodes can be used. When using SENDRECV, the participating nodes must form a cycle, and all nodes must be in some cycle.

Studies that have used COMMTEST include

Worley's Performance Studies Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 09:58:48 EDT.
7392 accesses since 1/2/96.