|
|
|
|
Optimal MPI_ALLREDUCE Node Performance
This experiment compares the optimal MPI_ALLREDUCE performance observed when using only one MPI process per node. This represents a lower bound on the performance when using multiple MPI processes per node or when computing the local part of the reduction using a shared memory algorithm. It also emphasizes differences in the performance of the interconnect for this operator, uncontaminated by contention for network access between MPI processes on the same SMP node or by communication between processes on the same node. Results are presented as a function of node count for three different allreduce vector lengths. Results for the platforms have differing maximum number of nodes (60, 64, and 128). Two graphs are presented for each vector length, one focusing on the common interval of 0-60 nodes and one displaying all of the data.
Allreduce (SUM) of a single REAL*8
![]()
![]()
The optimal MPI_ALLREDUCE node results are almost identical with the optimal allreduce node results for the 8 byte vector case, and MPI_Allreduce appears to be a good choice for all systems for small vector lengths. The AlphaServer SC system is the best performer, and the new Compaq software demonstrates a modest performance improvement, as well as more consistent behavior. The two IBM systems show similar performance for this vector length. After an initial performance drop off, all systems show reasonable scalability.
Allreduce (SUM) of a REAL*8 vector of length 1024
![]()
![]()
Graphs of the optimal MPI_ALLREDUCE node results are qualitatively similar to those for the optimal allreduce node results for the 8KB vector case. However, MPI_ALLREDUCE is a good choice only for the AlphaServer using Version 2.0 software. The Compaq system continues to be the better performer. Version 2.0 of the Compaq software again shows better performance than version 1.0. The Nighthawk II system demonstrates consistently better performance than the Winterhawk II system. A strong preference for power of two numbers of nodes is clear from the IBM results. Scalability appears to be good on the AlphaServer SC. Scalability of MPI_ALLREDUCE on the IBM systems is significantly worse than that indicated by the optimal algorithms, although apparently obeying a log P performance degradation law.
Allreduce (SUM) of a REAL*8 vector of length 262144
![]()
![]()
By comparing with the 2MB vector optimal allreduce node results, it is clear that MPI_ALLREDUCE is not the optimal algorithm for most of the timings on any of the systems. This is especially true on the AlphaServer SC, where the MPI_ALLREDUCE performance is worse than that of the Winterhawk II system (in contrast to the optimal allreduce node results). Performance of MPI_Allreduce is dominated by bandwidth, and the colony switch and dual adapters of the Nighthawk II system show their advantage here. Version 2.0 of the Compaq software continues to show a clear performance superiority to that of Version 1.0. The preference for power of two numbers of nodes on the IBM systems is no longer apparent, indicating a change in algorithm for the large vector lengths. Note, however, that the optimal algorithms still show this preference. Scalability is worse for MPI_ALLREDUCE than for the optimal algorithms for both the AlphaServer SC and Winterhawk II systems. Nighthawk II scalablity is excellent.