The Intel Paragon is a distributed-memory parallel
architecture built around a high-performance 2D grid interconnect.
The Paragon used in these experiments is a production machine managed
by the Center for Comnputational Science (CCS) at Oak Ridge National
Laboratory. Processors are not shared, and care was taken to use
partitions of the grid that are isolated from other other users,
to eliminate possible contention for bandwidth over shared links of the
interconnect grid.
For these experiments we used a 1x32 processor partition, the bottom
row of the 16x32 processor grid comprising the Paragon used in these
experiments.
This one dimensional partition allows us to test the one-dimensional
algorithms in exactly the same configurations they would be used in as part
of a two-dimensional data decomposition, both in terms of process placement
and contention. This obviates the need to perform Experiments B and C.
The most important results from the MPI protocol experiments are that
The choice of communication protocol is important for optimizing
performance. For some algorithms, finding the optimal protocol is necessary
to achieve good performance. For others, the important performance issue
is identifying and avoiding "bad" protocols.
The choice of protocol is most important in the small
granularity cases, but is still important in certain of the large granularity
cases.
Certain protocols should be avoided, especially those employing the
buffered send.
For the most part, the optimal protocols are (1) ordered (simple or
synchronous), not attempting to exploit bidirectional communication, or
(2) overlap protocols using nonblocking receive.
Overlap techniques are optimal for the ring-based algorithms, and are
useful for certain (primarily small granularity) cases for many of the other
algorithms.
Below, we summarize the parallel algorithm specific results.
To indicate the variation in performance over the set of MPI communication
protocols, we give the
relative difference between the median and minimum timings;
relative difference between the maximum and minimum timings;
for each of the Experiment A problem cases.
The data is presented in a table for each parallel algorithm.
The cases are not labelled in the table, but are listed in the following order:
T42 (P=16, 32, 8); T85 (P=16, 32, 8).
For brevity, we also describe the performance sensitivity to be low,
moderate, or high if the median-based statistic is <= 5%, between 5% and 15%,
and >= 15%, respectively.
The following observations apply to all of the algorithm results
and are listed here to cut down on the repetition:
The optimal protocols are easily identified.
The largest variation is found in the small granularity / large number
of processors cases, those with the largest theoretical
communication/computation ratio, except where noted otherwise.
The variation is much larger in transpose experiments 2 and 3 than in
transpose experiment 1.
DFFT
The performance variation is low to moderate (in the median):
  (med-min)/min  
  0.10  
  0.05  
  0.07  
  0.05  
  0.09  
  0.04  
  (max-min)/min  
  0.54  
  0.53  
  0.54  
  0.54  
  0.74  
  0.41  
b0 is the best protocol for the small granularity cases.
b6 is the best protocol for the large granularity cases.
EXCHSUM
The performance variation is low:
  (med-min)/min  
  0.05  
  0.05  
  0.02  
  0.02  
  0.04  
  0.02  
  (max-min)/min  
  0.80  
  1.13  
  0.36  
  0.54  
  1.03  
  0.29  
and the majority of the protocols perform very well.
However, the poorly performing protocols are really bad.
d3 is a good protocol for the small granularity cases.
b6 is a good protocol for the medium and large
granularity cases.
HALFSUM
The performance variation is low:
  (med-min)/min  
  0.05  
  0.05  
  0.01  
  0.02  
  0.03  
  0.01  
  (max-min)/min  
  0.44  
  0.58  
  0.21  
  0.29  
  0.52  
  0.15  
and the majority of the protocols perform very well.
c2 is the best protocol for the small granularity cases.
b6 is a good protocol for the medium and large
granularity cases.
RINGPIPE
The performance variation is high in all but the largest granularity case:
  (med-min)/min  
  0.29  
  0.26  
  0.17  
  0.20  
  0.24  
  0.13  
  (max-min)/min  
  0.81  
  0.90  
  0.36  
  0.46  
  0.65  
  0.27  
i3 is a good protocol for all cases.
RINGSUM
The performance variation is high for the small granularity cases, and
low otherwise:
  (med-min)/min  
  0.15  
  0.22  
  0.02  
  0.03  
  0.05  
  0.01  
  (max-min)/min  
  0.50  
  0.64  
  0.22  
  0.30  
  0.47  
  0.15  
c2 is a good protocol for all cases.
LOGTRANS (1)
The performance variation for logtrans (1) is low:
  (med-min)/min  
  0.02  
  0.02  
  0.02  
  0.03  
  0.01  
  (max-min)/min  
  0.31  
  0.34  
  0.49  
  0.67  
  0.28  
and the majority of the protocols perform very well.
d2 is a good protocol for all cases.
LOGTRANS (2)
The performance variation for logtrans (2) is low for the large
granularity case, and moderate to high otherwise:
  (med-min)/min  
  0.16  
  0.18  
  0.10  
  0.18  
  0.49  
  0.01  
  (max-min)/min  
  1.03  
  0.89  
  0.41  
  0.51  
  0.96  
  0.36  
The optimal protocol is very important for the large problem size /
large number of processors case.
c2 or c3 are good protocols for all cases.
LOGTRANS (3)
The performance variation for logtrans (3) is moderate to high:
  (med-min)/min  
  0.26  
  0.17  
  0.11  
  0.06  
  0.17  
  0.09  
  (max-min)/min  
  0.51  
  0.40  
  0.26  
  0.21  
  0.33  
  0.17  
c2 or c3 are good protocols for all cases.
SRTRANS (1)
The performance variation for srtrans (1) is low to moderate:
  (med-min)/min  
  0.11  
  0.02  
  0.04  
  0.07  
  0.02  
  (max-min)/min  
  0.33  
  0.20  
  0.22  
  0.33  
  0.19  
c2 is a good protocol for the small and medium granularity cases.
d6 is a good protocol for the large granularity cases.
SRTRANS (2)
The performance variation for srtrans (2) is low for the high
granularity case, and moderate to very high otherwise:
  (med-min)/min  
  0.21  
  0.38  
  0.13  
  0.26  
  0.71  
  0.03  
  (max-min)/min  
  1.14  
  1.14  
  0.27  
  0.54  
  1.69  
  0.24  
It also shows tremendous performance spikes for most cases.
The optimal protocol is very important for the large problem size /
large number of processors case.
a3 is a good protocol for 4 of the 6 (small granularity) cases.
b0 is a good protocol for the other 2 cases.
SRTRANS (3)
The performance variation for srtrans (3) is moderate to high:
  (med-min)/min  
  0.22  
  0.23  
  0.15  
  0.10  
  0.24  
  0.05  
  (max-min)/min  
  0.81  
  0.76  
  0.29  
  0.21  
  0.45  
  0.15  
a3 is a good protocol for the small and medium granularity cases.
d6 is a good protocol for the large granularity cases.
SWTRANS (1)
The performance variation for swtrans (1) is low to moderate:
  (med-min)/min  
  0.09  
  0.01  
  0.02  
  0.05  
  0.01  
  (max-min)/min  
  0.23  
  0.20  
  0.22  
  0.31  
  0.19  
c2 is a good protocol for the small and medium granularity cases.
d6 is a good protocol for the large granularity cases.
SWTRANS (2)
The performance variation for swtrans (2) is low for the large
granularity case, and moderate to very high otherwise:
  (med-min)/min  
  0.20  
  0.36  
  0.14  
  0.28  
  0.73  
  0.03  
  (max-min)/min  
  1.06  
  0.91  
  0.28  
  0.52  
  1.75  
  0.24  
It also shows tremendous performance spikes for most cases.
The optimal protocol is very important for the large problem size /
large number of processors case.
b0 is a good protocol in all cases.
SWTRANS (3)
The performance variation for swtrans (3) is moderate to high: