ALLREDUCE Performance Studies

Performance Studies using

ALLREDUCE


ALLREDUCE is a message-passing benchmark code that measures the time to perform the allreduce collective operator. In a reduction, each process in a group owns a vector whose elements are combined (using some commutative and associative operator) with the corresponding elements owned by other processes, producing a new vector of the same length. An allreduce is a reduction in which the result is replicated on all members of the original group of processes. The allreduce as used in scientific computing can be applied to both very short and very long vectors. A common short vector example is calculating the inner products found in conjugate gradient-like methods for solving linear systems. For short vectors, the dominant cost is typically the start-up cost of the messages required to implement the allreduce. A long vector example is the summation used to complete the Legendre transform in a distributed spectral transform algorithm, such as used in certain parallel algorithms in PSTSWM and CCM/MP-2D. ~\cite{Drake99,FosterWorley97}, which converts fields from a grid point representation to a spectral representation and back again during each timestep of a simulation. As three dimensional fields are being transformed, the vectors can be very long, and the interprocessor or network bandwidth is often the performance bottleneck. send messages between one or more processors using the same communication primitives used by PSTSWM and CCM/MP-2D.

Performance-critical interprocessor communication in PSTSWM and CCM/MP-2D is implemented using two basic types of commands: SWAP and SENDRECV. The message-passing transport layer used to implement these commands is specified at compile time, while the protocol used is specified at runtime.

COMMTEST measures the performance of the SWAP and SENDRECV commands, varying the number of communication requests for a fixed data volume. For example, if a 2MB message is being exchanged, this can be sent as a single message of size 2MB or as 1024 "packets", each of size 2KB. COMMTEST takes as input both the total message size and the minimum packet length. It then runs multiple experiments, sending the message as one packet, as two packets, etc., until it sends messages using the minimum packet length.

By varying the number of messages and message size, but leaving the total volume fixed, we can easily observe both the constant overhead in sending and receiving messages (looking at differences between measurements) and the achieved bandwidth. If communication performance satisfies a "latency, bandwidth" linear model, these parameters would be determined exactly, but the effects of changes in protocol, contention for shared resources, and the memory hierarchy make the observed parameters functions of the message packet size (and the state of the system). Spot estimates of the parameters are still of interest, and the minimum observed execution time can be used to calculate the maximum achieved bandwidth for a given experiment.

This approach also has the practical advantage of not requiring the measurement of either very small or very large execution times. For the larger message counts, the experiments are similar to the typical measurements of communication cost using repeated sends and receives. Note however that they move linearly through memory, with each send and receive accessing unique memory locations. For the smaller message counts, the larger overhead of the "first" call (instruction cache miss) becomes more noticeable, but the smaller message counts involve the largest messages, so the data movement itself should dominate the cost. There is also some intrinsic interest in determining whether it is better to explicitly send a large message in packets, or whether the message-passing transport layer takes care of such optimization issues automatically.

Two general classes of protocols are used: unordered (ping-ping) and ordered (ping-pong). While not all protocols are available for all message-passing transport layers, they are drawn from those described below. Examples are given using MPI commands for SWAP. Note that the examples have been simplified (to save room) and do not accurately represent the MPI implementations. In particular, handshaking messages required for correct use of the ready send command have been omitted.

Unordered

 

Ordered

(0,0): simple
Processor 1 and 2
MPI_BSEND
MPI_RECV
(1,0): simple
Processor 1 Processor 2
MPI_SEND MPI_RECV
MPI_RECV MPI_SEND
(0,1): nonblocking send
Processor 1 and 2
MPI_ISEND
MPI_RECV
(1,1): nonblocking send
Processor 1 Processor 2
MPI_ISEND MPI_RECV
MPI_RECV MPI_SEND
(0,2): nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SEND
(1,2): nonblocking receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_SEND MPI_SEND
(0,3): nonblocking send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISEND
(1,3): nonblocking send and receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_ISEND MPI_SEND
(0,4): ready send
Processor 1 and 2
MPI_IRECV
MPI_RSEND
(1,4): ready send
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_RSEND MPI_RSEND
(0,5): nonblocking ready send
Processor 1 and 2
MPI_IRECV
MPI_IRSEND
(1,5): nonblocking ready send
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_IRSEND MPI_RSEND
(0,6): native sendrecv
Processor 1 and 2
MPI_SENDRECV
 
 
 
(1,6): synchronous
Processor 1 Processor 2
- MPI_RECV
MPI_SEND -
- MPI_SEND
MPI_RECV -
(0,7): nonblocking sync. send
Processor 1 and 2
MPI_ISSEND
MPI_RECV
(1,7): nonblocking sync. send
Processor 1 Processor 2
MPI_ISSEND MPI_RECV
MPI_RECV MPI_SSEND
(0,8): sync. send and nonblocking receive
Processor 1 and 2
MPI_IRECV
MPI_SSEND
(1,8): sync. send and nonblocking receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_SSEND MPI_SSEND
(0,9): nonblocking sync. send and receive
Processor 1 and 2
MPI_IRECV
MPI_ISSEND
(1,9): nonblocking sync. send and receive
Processor 1 Processor 2
MPI_IRECV MPI_RECV
MPI_ISSEND MPI_SSEND
 
 
 
 
(1,10): sync. send
Processor 1 Processor 2
MPI_SSEND MPI_RECV
MPI_RECV MPI_SSEND
These protocols are described in more detail in

available from here.

COMMTEST is actually two separate programs. The first uses the basic SWAP or SENDRECV command to exchange the data. The second reorders the elements of the SWAP or SENDRECV protocol to prepost the nonblocking receives and delay waiting for the nonblocking commands to complete, in an attempt to overlap some of the start-up costs of one swap with the actual transmission time of another. The reorganization may also prevent some contention for shared resources when simultaneously sending and receiving. The logic of the reorganized test is as follows:

  1. Prepost one receive request.
  2. Synchronize.
  3. Begin timing.
  4. for i = 1, number of messages
    1. Prepost receive request for next message.
    2. Begin send and complete receive for current message.
    3. Complete send for previous message.
  5. Stop timing.

Both programs can be run in three different ways:

In all cases, the maximum execution time observed by a single processor is reported. When a global clock is available, the elapsed time between the first processor beginning the experiment and the last processor completing the experiment is recorded, and significant discrepancies from the single processor timing are noted.

In addition to total message size, minimum packet length, number of iterations, communication protocols, and whether to use SWAP or SENDRECV, COMMTEST input parameters include the distance between communicating nodes and whether all nodes are communicating or just node zero and its partners. For example, two nodes can be used when exchanging messages in a SWAP test, or all pairs of nodes can be used. When using SENDRECV, the participating nodes must form a cycle, and all nodes must be in some cycle.

Studies that have used COMMTEST include

Worley's Performance Studies Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 09:58:37 EDT.
5987 accesses since 1/2/96.