The IBM SP2 is a distributed-memory parallel
architecture utlizing high-end workstation-class processors
interconnected by an omega switch.
The SP2 used in these experiments was a production machine sited
at NASA Ames Research Laboratory.
While processors were not shared, the processor configuration and the
effect of other users on interprocessor communication performance
were not directly controllable. To minimize these effects, runs were repeated
multiple times and only timing runs not showing significant perturbations
were used in this analysis. Perturbations affecting individual protocol
timings were not eliminated, however, which adds to the "maximum" observed
performance variation.
In these experiments, we examine the protocol sensitivty of the MPL
communication library. MPL was the original native communication library for
the SP2. It has since been "replaced" by MPI, but is still interesting for
the differences in performance between the MPI and MPL libraries, given
their outward similarities.
The results described here come from legacy data collected over two years
ago. The particular SP2 used has been dismantled, and these results do not
reflect the most recent versions of the SP2 architecture.
At the time of these experiments, we were running only Experiment A.
We have two sets of Experiment A data, one collected in March, 1996
and one collected in July 1996. These differ in that the March 1996 data is
for runs that do not use the ESSL routines for the Fourier transforms, while
the July 1996 runs do. Including the math library
routines decreases the "granularity" of some of the experiments, but should
not change the optimal protocols. Neither set of data includes 32 processor
runs, and the July data only has data for 16 processor runs.
The most important results from the MPL protocol experiments are that
The choice of communication protocol is important for optimizing
performance. In particular, finding the optimal protocol may be necessary to
achieve good performance for some of the parallel algorithms.
The choice of protocol is generally more important in the small
granularity cases, but is useful for all cases and all algorithms.
The optimal protocol is a function of the problem size,
number of processors, and parallel algorithm, but
most optimal protocols use either nonblocking send or native
sendrecv.
Overlap techniques are important for srtrans, swtrans,
and ringpipe, with send-ahead being optimal
when sending small messages and for all cases when using ringpipe.
Overlap techniques are not useful for exchsum, halfsum, and
logtrans.
Below, we summarize the parallel algorithm specific results.
To indicate the variation in performance over the set of MPI communication
protocols, we give the
relative difference between the median and minimum timings;
relative difference between the maximum and minimum timings;
for each of the March 22nd Experiment A problem cases.
The data is presented in a table for each parallel algorithm.
The cases are not labelled in the table, but are listed in the following order:
T42 (P=16, 8); T85 (P=16, 8).
For brevity, we also describe the performance sensitivity to be low,
moderate, or high if the median-based statistic is <= 5%, between 5% and 15%,
or >= 15%, respectively.
The following observations apply to all of the algorithm results
and are listed here to cut down on the repetition:
The worst case differences can be misleading because of the
timing perturbations that were common on the SP2. The median statistics
are much more reliable.
Even using the median statistics, the performance variation
for the July data set is somewhat higher to much highter than that of the
March data set. It appears that more of the July data was contaminated by
other activity in the system. In comparison, the March data looks very
"clean", with respect to independent indicators of outside influence.
Many of the protocols perform significantly
worse than the optimum, and the optima are easily identified.
DFFT
The performance variation is moderate to high:
  (med-min)/min  
  0.24  
  0.09  
  0.09  
  0.09  
  (max-min)/min  
  0.87  
  0.23  
  0.24  
  0.29  
The March and July data indicate the same set of good protocols.
a6 is a good protocol for the small granularity case.
a1 is a good protocol for the medium granularity cases.
c1 is a good protocol for the large granularity case.
EXCHSUM
The performance variation is moderate to high:
  (med-min)/min  
  0.29  
  0.16  
  0.20  
  0.12  
  (max-min)/min  
  0.65  
  0.41  
  0.52  
  0.32  
The March and July data indicate the same set of good protocols.
a1 and a6 are good protocols for all cases.
HALFSUM
The performance variation is moderate to high:
  (med-min)/min  
  0.21  
  0.12  
  0.14  
  0.08  
  (max-min)/min  
  0.44  
  0.62  
  0.32  
  0.20  
The March and July data indicate the same set of good protocols.
a1 and a6 are good protocols for all cases.
RINGPIPE
The performance variation is high:
  (med-min)/min  
  0.27  
  0.23  
  0.27  
  0.16  
  (max-min)/min  
  1.22  
  1.66  
  0.92  
  0.46  
The March and July data indicate somewhat different sets of good
protocols.
c2, c3, or i1 are good protocols for all cases
(for both the March and July data).
RINGSUM
The performance variation is moderate to high:
  (med-min)/min  
  0.66  
  0.18  
  0.29  
  0.09  
  (max-min)/min  
  1.08  
  0.31  
  1.17  
  0.25  
The March and July data indicate different optima, but drawn from the
same set of good protocols.
a1 and c1 are good protocols for all cases.
LOGTRANS (1)
The performance variation is moderate to high:
  (med-min)/min  
  0.11  
  0.14  
  0.16  
  0.12  
  (max-min)/min  
  0.34  
  0.45  
  0.78  
  0.18  
The March and July data indicate the same set of good protocols.
a1 or a6 are good protocols for all cases.
LOGTRANS (2)
The performance variation is moderate to high:
  (med-min)/min  
  0.14  
  0.15  
  0.16  
  0.14  
  (max-min)/min  
  0.45  
  0.27  
  0.39  
  0.26  
The March and July data indicate the same set of good protocols.
a1 or a6 are good protocols for all cases.
LOGTRANS (3)
The performance variation is moderate to high:
  (med-min)/min  
  0.41  
  0.14  
  0.13  
  0.08  
  (max-min)/min  
  2.11  
  0.52  
  0.90  
  0.43  
The March and July data indicate the same set of good protocols.
a1 or a6 are good protocols for all cases.
SRTRANS (1)
The performance variation is moderate ot high:
  (med-min)/min  
  0.54  
  0.17  
  0.35  
  0.09  
  (max-min)/min  
  1.39  
  2.16  
  1.03  
  0.31  
The March and July data indicate the same set of good protocols.
a6 or e1 are good protocols for all cases.
SRTRANS (2)
The performance variation is moderate to high:
  (med-min)/min  
  0.44  
  0.22  
  0.27  
  0.10  
  (max-min)/min  
  0.76  
  0.78  
  0.89  
  0.34  
The March and July data indicate the same set of good protocols.
e1 or e3 are good protocols for all cases.
SRTRANS (3)
The performance variation is moderate to very high:
  (med-min)/min  
  0.86  
  0.20  
  0.19  
  0.06  
  (max-min)/min  
  1.91  
  0.43  
  0.39  
  0.22  
The March and July data indicate the same set of good protocols.
e1 or e3 are good protocols for all cases.
SWTRANS (1)
The performance variation is moderate to high:
  (med-min)/min  
  0.42  
  0.19  
  0.26  
  0.09  
  (max-min)/min  
  0.70  
  0.36  
  0.38  
  0.23  
The March and July data indicate the same set of good protocols.
e1, e3 or a1 are good protocols for all cases.
SWTRANS (2)
The performance variation is moderate to high:
  (med-min)/min  
  0.38  
  0.13  
  0.23  
  0.11  
  (max-min)/min  
  1.38  
  0.34  
  0.93  
  0.26  
The March and July data indicate the same set of good protocols.
e1, e3 or a3 are good protocols for all cases.
SWTRANS (3)
The performance variation is moderate to very high:
  (med-min)/min  
  1.13  
  0.19  
  0.20  
  0.06  
  (max-min)/min  
  3.40  
  0.38  
  0.41  
  0.17  
The March and July data indicate the same set of good protocols.
e1, e3 or a3 are good protocols for all cases.
In additional to those mentioned earlier,
some general rules of thumb appear to apply.
The all-to-all algorithms srtrans and swtrans have the
same sensitivities and similar optimal protocols.
The binary tree algorithms exchsum, halfsum and
logtrans have the same set of optimal protocols (a1 and
a6), and similar sensitivities.
dfft is sensitivities and protocols are also similar to
exchsum, etc., except for the inclusion of the overlap protocol
c1 for the alrgest granularity case.
Unlike many of the other platforms/libraries, the ring algorithms
ringpipe and ringsum do not have similar optimal protocols
except in a very rough sense.
The (median) variation is "mildly" proportional to the
inverse of the problem granularity for all except the logtrans
and ringpipe algorithms. The particulars of the relationship vary
between the algorithms.