|
|
|
|
Average MPI_ALLREDUCE Node Performance
This experiment compares the average MPI_ALLREDUCE performance observed when using only one MPI process per node. As different algorithms are optimal for different node counts and vector lengths, it is difficult to assess average performance using the "optimal algorithms". MPI_ALLREDUCE is the algorithm that most users would use in practice, and understanding its average performance has intrinsic value. Moreover, the large oversampling (120 timings per vector length and node count) makes it possible to assess typical performance.
This experiment represents a lower bound on the performance when using multiple MPI processes per node and when computing the local part of the reduction using a shared memory algorithm. It also emphasizes differences in the performance of the interconnect for this operator, uncontaminated by contention for network access between MPI processes on the same SMP node and by communication between processes on the same node. Results are presented as a function of node count for three different allreduce vector lengths. Results for the platforms have differing maximum number of nodes (60, 64, and 128). Two graphs are presented for each vector length, one focusing on the common interval of 0-60 nodes and one displaying all of the data.
Allreduce (SUM) of a single REAL*8
![]()
![]()
The average MPI_ALLREDUCE node results are very similar with the optimal MPI_Allreduce node results for the 8 byte vector case. The noisiness in the optimal results for the IBM systems is significantly reduced when compared with the optimal MPI_Allreduce node results, primarily due to the elimination of high performance "spikes". However, the overall performance and performance trends are remarkably unaffected.
Allreduce (SUM) of a REAL*8 vector of length 1024
![]()
![]()
The average MPI_ALLREDUCE node results are (also) very similar with the optimal MPI_Allreduce node results for the 8KB vector case. The average results are noisier than the optimal MPI_Allreduce node results for the Compaq systems, but noise is not significant in the IBM results. Scalability is still good on the AlphaServer SC.
Allreduce (SUM) of a REAL*8 vector of length 262144
![]()
![]()
The average MPI_ALLREDUCE node results are essentially identical with the optimal MPI_Allreduce node results for the 2MB byte vector case.