put logo here
CSMD
people
people
sitemap
search

Allreduce Performance Evaluation

Evaluation of Early Systems

Optimal MPI_ALLREDUCE Processor Performance

This experiment compares the optimal MPI_ALREDUCE performance observed when assigning an MPI process to each processor in an SMP node. This is a standard way that the systems are used, and includes the effect of communicating between processors in the same SMP node. While a pure MPI implementation may not be the most efficient way of using a cluster of SMPs, this experiment does not address that particular question. Results for the platforms have differing maximum numbers of processor (120, 240, 512, and 1024). Two graphs are presented for each vector length, one focusing on the common interval of 0-120 processors and one displaying all of the data.

Allreduce (SUM) of a single REAL*8

The optimal MPI_ALLREDUCE processor results are almost identical with the optimal allreduce processor results for the 8 byte vector case, and MPI_Allreduce appears to be a good choice for all systems. Version 2.0 of the AlphaServer SC software shows significant improvement over the original software, and the AlphaServer SC shows the best performance for this vector length. Even the AlphaServer SC with the original system software appears to have better scaling behavior than the IBM systems. The performance of the two IBM systems is similar, and the IBM systems show increased "noisiness" when compared with (already noisy) optimal allreduce node performance results. The scalability looks worse than in the node-only experiment, but the small processor counts reflect communication within a node. To be fair, scalability should be examined using the 8 processor count timings for the AlphaServer SC and Winterhawk II systems and 32 processor count timings for the Nighthawk II system. By this metric, the performance degradation for the Nighhawk II system is less than a factor of 10, but the downward trend still appears to be steeper than for the node-only results.

Allreduce (SUM) of a REAL*8 vector of length 1024

MPI_ALLREDUCE continues to be a good choice on the AlphaServer SC for 8KB vectors. In contrast, MPI_ALLREDUCE of the IBM systems does not perform well for this vector length. In consequence, the Compaq system is the better performer. The improvement due to the new AlphaServer SC software is still significant. The high bandwidth (?) within a 16-way Nighthawk II node gives it a boost for small processor counts, but it is never better than MPI_ALLREDUCE on the AlphaServer SC with the new software. The IBM results are less noisy than in the 8 byte vector case, and, measuring against the 32 processor results, the Nighthawk II scaling seems somewhat improved over the 8 byte vector results. The Nighthawk II results are significantly better than the Winterhawk II results.

Allreduce (SUM) of a REAL*8 vector of length 262144

MPI_ALLREDUCE is not a good choice on any of the systems for 2MB vectors. However, it is a much poorer choice on the IBM systems, and MPI_ALLREDUCE on the AlphaServer SC with the new software demonstrates the best performance. In contrast, the AlphaServer SC with the old software is the slowest. Note that these results are significantly different from the optimal allreduce processor results. Despite its poor performance, the Nighthawk II system scales reasonably well, while the Winterhawk II system does not. The scaling of the AlphaServer SC system with the old software is similar to that of the optimal processor results, which is not very good, but does appear to level off for large processor counts.

ornl | ccs | csm| disclaimer | search

URL http://www.csm.ornl.gov/evaluation/ALLREDUCE/opt013_processor_allreduce.html
Updated: Thursday, 23-Aug-2001 14:32:11 EDT
webmaster