put logo here
CSMD
people
people
sitemap
search

Allreduce Performance Evaluation

Evaluation of Early Systems

Average MPI_ALLREDUCE Processor Performance

This experiment compares the average MPI_ALLREDUCE performance observed when assigning an MPI process to each processor in an SMP node. As different algorithms are optimal for different node counts and vector lengths, it is difficult to assess average performance using the "optimal algorithms". MPI_ALLREDUCE is the algorithm that most users would use in practice, and understanding its average performance has intrinsic value. Moreover, the large oversampling (120 timings per vector length and node count) makes it possible to assess typical performance.

This experiment (using all processors in an SMP node) represents a standard way that the systems are used, and includes the effect of communicating between processors in the same SMP node. While a pure MPI implementation may not be the most efficient way of using a cluster of SMPs, this experiment does not address that particular question. Results for the platforms have differing maximum numbers of processor (120, 240, 512, and 1024). Two graphs are presented for each vector length, one focusing on the common interval of 0-120 processors and one displaying all of the data.

Allreduce (SUM) of a single REAL*8

The effect of averaging is system dependent for the 8 byte vector lengths. Average and optimal MPI_ALLREDUCE processor performance on the AlphaServer SC using the new system software are very similar. In contrast, with the old system software, there were significant "events" that degraded performance by as much as a factor of 4, events whose frequency increases as the processor count increases. The average behavior is significantly worse for all processor counts for the IBM systems. The average performance on the Winterhawk II system "stabilizes" with a performance degradation of around 50%. The Nighthawk II performance variability only increases with high processor counts, resulting in both erratic behavior and poor scalability.

Allreduce (SUM) of a REAL*8 vector of length 1024

In contrast to the 8 byte vector results, average and optimal MPI_ALLREDUCE processor performance for 8KB vectors on the AlphaServer SC differ greatly (for both old and new software). The AlphaServer SC performance degradation drops it significantly behind that of the two IBM systems, and both old and new system software demonstrate very poor scalability. The effect of averaging on the performance of the two IBM systems is similar to that for the 8 byte vector, but with somewhat decreased relative performance degradation and variation. The Winterhawk II system demonstrates the best scalability among all of the systems.

Allreduce (SUM) of a REAL*8 vector of length 262144

Graphs of average and optimal MPI_ALLREDUCE processor performance for 2MB vectors are qualitatively identical for all systems, with excellent scalability for the AlphaServer SC (Version 2.0) and Nighthawk II systems. The average performance is somewhat less than the optimal, but typically by no more than 25% for any of the systems.

ornl | ccs | csm| disclaimer | search

URL http://www.csm.ornl.gov/evaluation/ALLREDUCE/avg013_processor_allreduce.html
Updated: Thursday, 23-Aug-2001 15:04:36 EDT
webmaster