put logo here
CSMD
people
people
sitemap
search

Allreduce Performance Evaluation

Evaluation of Early Systems

Optimal Allreduce Node Performance

This experiment compares the optimal performance observed when using only one MPI process per node. This represents a lower bound on the performance when using multiple MPI processes per node or when computing the local part of the reduction using a shared memory algorithm. It also emphasizes differences in the performance of the interconnect for this operator, uncontaminated by contention for network access between MPI processes on the same SMP node or by communication between MPI processes on the same node. Results are presented as a function of node count for three different allreduce vector lengths. Results for the platforms have differing maximum number of nodes (60, 64, and 128). Two graphs are presented for each vector length, one focusing on the common interval of 0-60 nodes and one displaying all of the data.

Allreduce (SUM) of a single REAL*8

Performance of allreduce on an 8 byte vector should be dominated by latency. By this analysis, the Quadrics switch provides consistently lower latency than either of the IBM switches. The new Compaq software demonstrates a modest performance improvement, as well as more consistent behavior. The two IBM systems show similar performance for this vector length. After an initial performance drop off, all systems show reasonable scalability. The "noisiness" in the IBM results is very high, however.

Allreduce (SUM) of a REAL*8 vector of length 1024

Performance of allreduce on an 8KB vector appears to be sensitive to both latency and bandwidth. The Compaq system continues to be the better performer, although not by as large a margin. Version 2.0 of the Compaq software continues to show better and more regular performance than version 1.0. The Nighthawk II system demonstrates consistently better performance than the Winterhawk II system. A strong preference for power of two numbers of nodes is clear from the IBM results. Scalability is good after the initial performance degradation. The "noisiness" in the IBM results is significantly less than for the 8 byte vector.

Allreduce (SUM) of a REAL*8 vector of length 262144

Performance of allreduce on a 2MB vector appears to dominated by bandwidth. The Colony switch and dual adapters of the Nighthawk II system show their advantage here, being as much as twice as fast as the AlphaServer SC. The AlphaServer SC is only slightly better than the Winterhawk II system for this problem size. The two versions of the Compaq software have similar performance, indicating no improvement in bandwidth from the implementation changes. The preference for power of two numbers of nodes on the IBM systems is even more strongly indicated. Scalability is excellent.

ornl | ccs | csm| disclaimer | search

URL http://www.csm.ornl.gov/evaluation/ALLREDUCE/optimal_node_allreduce.html
Updated: Thursday, 23-Aug-2001 12:22:09 EDT
webmaster