|
|
|
|
Optimal Allreduce Processor Performance
This experiment compares the optimal performance observed when assigning an MPI process to each processor in an SMP node. This is a standard way that the systems are used, and includes the effect of communicating between processors in the same SMP node. While a pure MPI implementation may not be the most efficient way of using a cluster of SMPs, this experiment does not address that particular question. Results for the platforms have differing maximum number of processors (120, 240, 512, and 1024). Two graphs are presented for each vector length, one focusing on the common interval of 0-120 processors and one displaying all of the data.
Allreduce (SUM) of a single REAL*8
![]()
![]()
Version 2.0 of the AlphaServer SC software shows significant improvement over the original software, and the AlphaServer SC shows the best performance for this vector length. Even the AlphaServer SC with the original system software appears to have better scaling behavior than the IBM systems. The performance of the two IBM systems is similar, and the IBM systems show increased "noisiness" when compared with (already noisy) optimal allreduce node performance results. The scalability looks worse than in the node-only experiment, but the small processor counts reflect communication within a node. To be fair, scalability should be examined using the 8 processor count timings for the AlphaServer SC and Winterhawk II systems and 32 processor count timings for the Nighthawk II system. By this metric, the performance degradation for the Nighhawk II system is less than a factor of 10, but the downward trend still appears to be steeper than for the node-only results.
Allreduce (SUM) of a REAL*8 vector of length 1024
![]()
![]()
The improvement due to the new AlphaServe SC software is still significant for an 8KB vector, but is not as large. The high bandwidth (?) within a 16-way Nighthawk II node gives it a boost for small processor counts, but the AlphaServer SC continues to have better performance for most processor counts. The IBM results are less noisy than in the 8 byte vector case, and, measuring against the 32 processor results, the Nighthawk II scaling seems somewhat improved over the 8 byte vector results. The Nighthawk II results are consistently better than the Winterhawk II results.
Allreduce (SUM) of a REAL*8 vector of length 262144
![]()
![]()
The performance of allreduce on a 2MB vector is dominated by bandwidth. The Nighthawk II system is the better performer, with excellent scaling properties out to 1024 processors. The Winterhawk II and AlphaServer SC system scalings are not as good, but the AlphaServer SC performance does appear to level off for large processor counts.