|
|
|
|
ALLREDUCE
The following performance data were collected by Patrick H. Worley on the AlphaServer SC and IBM SP systems at Oak Ridge National Laboratory and on the IBM SP Phase II system at the National Energy Research Scientific Computing Center during the summer of 2001 using a code designed to measure the performance of the allreduce collective communication operator. The ORNL SP system uses the "SP Switch" and one SPSMX2 switch adapter per 4-way Winterhawk II SMP node, while the NERSC SP system uses the "SP Switch2" (Colony switch) and two TB3PCI switch adapters per 16-way Nighthawk II SMP node. The AlphaServer SC uses the Quadrics interconnect and one switch adapter per 4-way ES40 SMP node. For the Compaq system, results are presented for both version 1.0 and version 2.0 of the system software. Version 2.0 includes additional optimizations of the MPI collective communication operators.The test code measures the performance of the native MPI_Allreduce and a number of alternative implementations under a variety of conditions. For the first two experiments, performance is reported for the fastest time observed when executing a single allreduce. Many experiments were run, and the optimal timings appear to be relatively free from effects due to unrelated system interrupts. Results are presented in terms of the number of allreduces per second, that is, the inverse of the measured times. Experiments 3 and 4 examine the optimal observed performance for MPI_ALLREDUCE. Each run of the test code times MPI_ALLREDUCE 30 times for each processor or node count and for each vector length, and the test code is run three times for each experiment. This "oversampling" should provide a large enough sample to determine optimal MPI_ALLREDUCE performance. Experiments 5 and 6 examine the average observed performance for MPI_ALLREDUCE. The oversampling should also provide a large enough sample to gauge typical performance.