The Intel Paragon is a distributed-memory parallel
architecture built around a high-performance 2D grid interconnect.
The Paragon used in these experiments was a production machine managed
by Sandia National Laboratory. Processors were not shared, and care was taken
to use partitions of the grid that were isolated from other other users,
to eliminate possible contention for bandwidth over shared links of the
interconnect grid.
In these experiments we examine the protocol sensitivity of interprocessor
communication when using the SUNMOS operating system, which was developed at
Sandia and the University oif New Mexico, While both NX and MPI libraries are
available for SUNMOS, we used the low level message passing primitives
provided by SUNMOS (_nsend and _nrecv) for this study.
The results described here come from legacy data collected over three years
ago. The particular Paragon used has been dismantled, but these results
should still accurately reflect the interprocessor communication performance
of existing Paragon systems running SUNMOS. (The platform hardware has not
changed over this period.)
The data is particularly interesting as a way of measuring the impact of
different approaches to parallel operating systems and interprocessor
communication (OSF vs. SUNMOS) on the same hardware.
At the time of these experiments, we were running only Experiment A.
We also did not attempt to use 1D partitions that would reflect
the processors used in a 2D parallelization. Partitions for these
experiments typically had square or near-square aspect ratios.
The most important results from the SUNMOS protocol experiments are that
The choice of communication protocol is not very important for
optimizing performance. With a few exceptions, the choice of protocol is
important only in the smallest granularity cases. Even in these cases, it is
primarily important to identify the poorly performing protocols.
For the distributed algorithms, the optimal protocols are easily
identified, even though the minimum is "weak".
For the transpose algorithms, it is more difficult to identify "optimal "
protocols, but the minimum is weak enough that it is not very improtant.
There are exceptions to this rule though, and some care should be taken.
For the most part, the optimal protocols for the distributed
algorithms and for logtrans are (1) synchronous or (2) overlap
protocols, both with and without the ready send handshaking protocols. (Ready
send is not an option with SUNMOS, but the handshaking protocol can still be
used.)
Overlap techniques are optimal for at least some of the cases for all of
the algorithms, but are really only important for the ring algorithms and
for the binary tree algorithms (exchsum, halfsum, and
logtrans).
Below, we summarize the parallel algorithm specific results.
To indicate the variation in performance over the set of MPI communication
protocols, we give the
relative difference between the median and minimum timings;
relative difference between the maximum and minimum timings;
for each of the Experiment A problem cases.
The data is presented in a table for each parallel algorithm.
The cases are not labelled in the table, but are listed in the following order:
T42 (P=16, 32, 8); T85 (P=16, 32, 8).
For brevity, we also describe the performance sensitivity to be low,
moderate, or high if the median-based statistic is <= 5%, between 5% and 15%,
and >= 15%, respectively.
DFFT
The performance variation is low (in the median):
  (med-min)/min  
  0.04  
  0.04  
  0.04  
  0.03  
  0.02  
  0.04  
  (max-min)/min  
  0.08  
  0.08  
  0.06  
  0.06  
  0.05  
  0.07  
and the majority of the protocols perform very well.
b6 is the best protocol for the small granularity cases.
c5 is a good protocol for the medium and large granularity cases.
EXCHSUM
The performance variation is low to medium:
  (med-min)/min  
  0.06  
  0.09  
  0.02  
  0.03  
  0.05  
  0.02  
  (max-min)/min  
  0.12  
  0.21  
  0.05  
  0.10  
  0.19  
  0.05  
and the majority of the protocols perform well.
c5 is the best protocol for the small and medium granularity
cases.
a4 is a good protocol for the large granularity cases.
HALFSUM
The performance variation is (very) low:
  (med-min)/min  
  0.04  
  0.03  
  0.00  
  0.00  
  0.02  
  0.01  
  (max-min)/min  
  0.07  
  0.10  
  0.03  
  0.05  
  0.10  
  0.03  
and the majority of the protocols perform very well.
c3 is a good protocol for the small and medium granularity
cases.
c5 is the best protocol for the large granularity case.
RINGPIPE
The performance variation is low to moderate:
  (med-min)/min  
  0.10  
  0.09  
  0.02  
  0.04  
  0.06  
  0.01  
  (max-min)/min  
  0.28  
  0.37  
  0.04  
  0.06  
  0.12  
  0.03  
and the majority of the protocols perform very well.
However, some care must be taken to avoid poorly performing protocols for the
small granularity cases.
i3 or c4 are good protocols for all cases.
RINGSUM
The performance variation is low to moderate:
  (med-min)/min  
  0.06  
  0.08  
  0.01  
  0.03  
  0.07  
  0.00  
  (max-min)/min  
  0.17  
  0.28  
  0.03  
  0.05  
  0.09  
  0.03  
c2 is a good protocol for all cases.
LOGTRANS (1)
The performance variation for logtrans (1) is very low:
  (med-min)/min  
  0.02  
  0.01  
  0.01  
  0.01  
  0.02  
  (max-min)/min  
  0.05  
  0.05  
  0.06  
  0.06  
  0.05  
and the majority of the protocols perform very well.
c2 is a good protocol for the small and medium granularity cases.
c5 is a good protocol for the large granularity case.
LOGTRANS (2)
The performance variation for logtrans (2) is very low:
  (med-min)/min  
  0.02  
  0.01  
  0.00  
  0.01  
  0.01  
  0.02  
  (max-min)/min  
  0.06  
  0.03  
  0.07  
  0.08  
  0.08  
  0.07  
and the majority of the protocols perform very well.
b6 is a good protocol for the small granularity cases.
c2 is a good protocol for the medium granularity cases.
c5 is a good protocol for the large granularity case.
LOGTRANS (3)
The performance variation for logtrans (3) is very low:
  (med-min)/min  
  0.02  
  0.02  
  0.01  
  0.01  
  0.01  
  0.00  
  (max-min)/min  
  0.04  
  0.04  
  0.04  
  0.04  
  0.04  
  0.04  
and the majority of the protocols perform very well.
b6 or c2 are good protocols for all cases.
SRTRANS (1)
The performance variation for srtrans (1) is extremely low:
  (med-min)/min  
  0.03  
  0.01  
  0.01  
  0.03  
  0.01  
  (max-min)/min  
  0.14  
  0.04  
  0.05  
  0.07  
  0.04  
and the majority of the protocols perform very well.
a1 or c2 are good protocols for the small and medium
granularity cases.
a5 is a good protocol for the large granularity case.
SRTRANS (2)
The performance variation for srtrans (2) is low to moderate:
  (med-min)/min  
  0.04  
  0.07  
  0.01  
  0.02  
  0.03  
  0.01  
  (max-min)/min  
  0.12  
  0.22  
  0.05  
  0.07  
  0.07  
  0.05  
and the majority of the protocols perform very well.
However, some care must be taken to avoid poorly performing protocols for the
small granularity cases.
c2 or e2 are good protocols for 4 of the 6 (small
granularity) cases.
a1 or a5 are good protocols for the other 2 cases.
SRTRANS (3)
The performance variation for srtrans (3) is moderate for the
small granularity cases, and very low otherwise:
  (med-min)/min  
  0.06  
  0.12  
  0.01  
  0.01  
  0.01  
  0.01  
  (max-min)/min  
  0.22  
  0.37  
  0.03  
  0.02  
  0.04  
  0.02  
and the majority of the protocols perform very well.
However, some care must be taken to avoid poorly performing protocols for the
small granularity cases.
a1 or e2 are good protocols for 4 of the 6 (small
granularity) cases.
a5 or c2 are good protocols for the other 2 cases.
SWTRANS (1)
The performance variation for srtrans (1) is extremely low:
  (med-min)/min  
  0.03  
  0.01  
  0.01  
  0.03  
  0.01  
  (max-min)/min  
  0.09  
  0.04  
  0.05  
  0.06  
  0.04  
and the majority of the protocols perform very well.
a0 or b6 are good protocols for all cases.
SWTRANS (2)
The performance variation for srtrans (1) is extremely low
for all but the one small granularity case, in which it is moderate:
  (med-min)/min  
  0.02  
  0.08  
  0.01  
  0.02  
  0.02  
  0.01  
  (max-min)/min  
  0.07  
  0.18  
  0.06  
  0.08  
  0.08  
  0.05  
and the majority of the protocols perform very well.
a0 or b6 are good protocols for all cases.
SWTRANS (3)
The performance variation for swtrans (3) is low to high:
  (med-min)/min  
  0.06  
  0.11  
  0.08  
  0.19  
  0.01  
  0.22  
  (max-min)/min  
  1.11  
  0.24  
  0.11  
  0.21  
  0.07  
  0.25  
and very erratic.
The majority of protocols perform significantly worse than optimum for some
of the cases.