The Intel Paragon is a distributed-memory parallel
architecture built around a high-performance 2D grid interconnect.
The Paragon used in these experiments is a production machine managed
by the Center for Comnputational Science (CCS) at Oak Ridge National
Laboratory. Processors are not shared, and care was taken to use
partitions of the grid that are isolated from other other users,
to eliminate possible contention for bandwidth over shared links of the
interconnect grid.
In these experiments we examine the protocol sensitivity of the NX
communication library. NX is the native communication library for the
Paragon (on top of which MPI is implemented).
We generated Experiment A data twice. For the first set of experiments (A1),
we used a 1x32 processor partition, the bottom
row of the 16x32 processor grid comprising the Paragon used in these
experiments.
This one-dimensional partition allows us to test the one-dimensional
algorithms in exactly the same configurations they would be used in as part
of a two-dimensional data decomposition, both in terms of process placement
and contention. This obviates the need to perform Experiments B and C.
This is the same partition used for the OSF/MPI experiments.
For the second set of experiments (A2), we used partitions of
size 4x2, 4x4, and 8x4, and used the PSTSWM Fortran FFT routines,
instead of the KAI routines used in A1. This corresponds to the partitions
and math libraries used in the October, 1994 SUNMOS experiments, allowing us
to compare these two sets of data directly.
The most important results from the NX protocol experiments are that
The choice of communication protocol can be important for optimizing
performance. For a few algorithms, finding the optimal protocol is necessary
to achieve good performance. For others, the important performance issue
is identifying and avoiding "bad" protocols.
The choice of protocol is somewhat more important in the small
granularity cases.
The optimal protocol depends on the parallel algorithm, problem
granularity, and the number of processors (i.e., everything).
Overlap techniques are optimal for the ring-based algorithms, and are
useful for certain (primarily small granularity) cases for some of the other
algorithms.
For all of the algorithms EXCEPT srtrans and swtrans, the
different partitions and math routines used in Experiments A1 and A2
change the magnitude of the performance variation, but not the form.
Moreover, for each of these algorithms the optimal protocols are essentially
the same for A1 and A2.
For srtrans and swtrans, the performance sensitivities
differ between A1 and A2, especially for the transpose experiments 2 and 3.
Below, we summarize the parallel algorithm specific results.
To indicate the variation in performance over the set of NX communication
protocols, we give the
relative difference between the median and minimum timings;
relative difference between the maximum and minimum timings;
for each of the Experiment A1 problem cases.
The data is presented in a table for each parallel algorithm.
The cases are not labelled in the table, but are listed in the following order:
T42 (P=16, 32, 8); T85 (P=16, 32, 8).
For brevity, we also describe the performance sensitivity to be low,
moderate, or high if the median-based statistic is <= 5%, between 5% and 15%,
and >= 15%, respectively.
The following observations apply to more than one of the algorithm results
and are listed here to cut down on the repetition:
The optimal protocols are easily identified for A1.
The largest variation is found in the small granularity / large number
of processors cases, those with the largest theoretical
communication/computation ratio, except where noted otherwise.
The variation is much larger in transpose experiments 2 and 3 than in
transpose experiment 1 for experiment A1, but is similar for experiment A2.
For each algorithm other than srtrans and swtrans, a set
of protocols can be found that performs well for both experiment A1 and A2.
For srtrans and swtrans, we report protocols that
perform well for either A1 or A2, but not necessarily both.
Note, however, that none of these algorithms perform poorly for either A1 or
A2.
The performance variation is higher for A1 than for A2.
The degree of increase is a function of the algorithm, but it is generally
greatest for the small granularity cases.
DFFT
The A1 performance variation is low to moderate:
  (med-min)/min  
  0.05  
  0.03  
  0.05  
  0.04  
  0.07  
  0.03  
  (max-min)/min  
  0.19  
  0.15  
  0.11  
  0.13  
  0.19  
  0.08  
a2 or c2 are the best protocols for the small granularity
cases.
b4 is a good protocol for the medium and large granularity cases.
EXCHSUM
The A1 performance variation is low in all but one case:
  (med-min)/min  
  0.05  
  0.06  
  0.02  
  0.02  
  0.04  
  0.03  
  (max-min)/min  
  0.18  
  0.22  
  0.06  
  0.07  
  0.15  
  0.05  
d3 is a good protocol for the small granularity cases.
b4 is a good protocol for the medium and large
granularity cases.
HALFSUM
The A1 performance variation is low:
  (med-min)/min  
  0.05  
  0.03  
  0.02  
  0.02  
  0.03  
  0.02  
  (max-min)/min  
  0.10  
  0.07  
  0.04  
  0.05  
  0.09  
  0.04  
c2 is a good protocol for the small granularity cases.
b4 is a good protocol for the medium and large granularity cases.
RINGPIPE
The A1 performance variation is moderate to high:
  (med-min)/min  
  0.25  
  0.30  
  0.17  
  0.21  
  0.27  
  0.13  
  (max-min)/min  
  0.64  
  0.78  
  0.21  
  0.26  
  0.36  
  0.14  
i2 is a good protocol for the small granularity cases.
i3 is a good protocol for the medium and large granularity cases.
RINGSUM
The A1 performance variation is moderate to high for the smallest
granularity cases, and low otherwise:
  (med-min)/min  
  0.14  
  0.15  
  0.02  
  0.02  
  0.04  
  0.01  
  (max-min)/min  
  0.36  
  0.43  
  0.03  
  0.04  
  0.07  
  0.02  
c2 is a good protocol for all cases.
LOGTRANS (1)
The A1 performance variation of logtrans (1) is low:
  (med-min)/min  
  0.03  
  0.02  
  0.02  
  0.02  
  0.01  
  (max-min)/min  
  0.06  
  0.06  
  0.10  
  0.11  
  0.05  
c2 is a good protocol for the small granularity case.
d2 is a good protocol for the medium and large granularity cases.
LOGTRANS (2)
The A1 performance variation of logtrans (2) is low to very high:
  (med-min)/min  
  0.06  
  0.04  
  0.08  
  0.18  
  0.51  
  0.02  
  (max-min)/min  
  0.96  
  0.89  
  0.19  
  0.38  
  1.00  
  0.04  
The optimal protocol is very important for the large problem size /
large number of processors case.
c2 or d2 are good protocols for all cases.
LOGTRANS (3)
The A1 performance variation of logtrans (3) is moderate to high:
  (med-min)/min  
  0.27  
  0.17  
  0.11  
  0.07  
  0.13  
  0.06  
  (max-min)/min  
  0.50  
  0.34  
  0.19  
  0.14  
  0.30  
  0.14  
c2 or d2 are good protocols for all cases.
SRTRANS (1)
The A1 performance variation of srtrans (1) is low to moderate:
  (med-min)/min  
  0.11  
  0.02  
  0.03  
  0.04  
  0.02  
  (max-min)/min  
  0.29  
  0.06  
  0.08  
  0.12  
  0.05  
e3 or a6 are good protocols for all cases.
SRTRANS (2)
The A1 performance variation of srtrans (2) is low to very high:
  (med-min)/min  
  0.56  
  0.43  
  0.14  
  0.30  
  0.74  
  0.04  
  (max-min)/min  
  1.20  
  1.13  
  0.32  
  0.62  
  1.42  
  0.12  
The optimal protocol is very important for the large problem size /
large number of processors case.
a6 (for A1) or e3 (for A2) are good protocols for the
small granularity cases.
b2 (for A1) or c2 (for A2) are good protocols for the
medium and large granularity cases.
SRTRANS (3)
The A1 performance variation is moderate to high:
  (med-min)/min  
  0.35  
  0.24  
  0.15  
  0.09  
  0.24  
  0.12  
  (max-min)/min  
  0.73  
  0.68  
  0.32  
  0.17  
  0.40  
  0.18  
a2 (for A1) or c2 (for A2) are good protocols for 4 of the
6 cases.
a6 (for A1) or e3 (for A2) are good protocols for the
remaining cases.
SWTRANS (1)
The A1 performance variation of swtrans (1) is low to moderate:
  (med-min)/min  
  0.09  
  0.01  
  0.01  
  0.03  
  0.02  
  (max-min)/min  
  0.20  
  0.05  
  0.06  
  0.11  
  0.06  
e3 is a good protocol for the smaller granularity cases.
a2 is a good protocol for the larger granularity cases.
SWTRANS (2)
The A1 performance variation of swtrans (2) is low to very high:
  (med-min)/min  
  0.58  
  0.40  
  0.15  
  0.32  
  0.79  
  0.05  
  (max-min)/min  
  1.24  
  1.01  
  0.32  
  0.64  
  1.48  
  0.12  
The optimal protocol is very important for the large problem size /
large number of processors case.
a6 (for A1) or e3 (for A2) are good protocols for the
small granularity cases.
b0 (for A1) or c2 (for A2) are good protocols for the
medium and large granularity cases.
SWTRANS (3)
The A1 performance variation is moderate to high:
  (med-min)/min  
  0.31  
  0.10  
  0.13  
  0.09  
  0.25  
  0.13  
  (max-min)/min  
  0.65  
  0.31  
  0.32  
  0.15  
  0.37  
  0.19  
a2 (for A1) or c2 (for A2) are good protocols for all cases.