The SGI/Cray Research Origin2000 is a Nonuniform Access Memory Access (NUMA)
shared memory architecture. The SHMEM library is built on top
of the shared memory. We use the SHMEM one-sided communication
operations (get and put) to construct a simple message passing
library that is suitable for the PSTSWM application, not implementing the more
general (and complex) semantics included with standard message-passing
libraries such as MPI and PVM.
The Origin2000s used in these experiments are located
at Los Alamos National Laboratory, where they are being used to construct
a very large parallel system. The system software changes
more rapidly than at most other Origin2000 sites, and the useful "lifespan"
of our experimental data may be relatively short.
The most important results from the protocol experiments are that
The choice of communication protocol can be important for optimizing
performance.
The optimal choice is a (complicated) function of the problem size and
number of processors.
Overlap techniques are useful for most algorithms.
For the ring algorithms, all optimal protocols use overlap techniques.
Experiments B and C attempt to introduce contention and realisitic
process placement.
Experiments B and C show less relative protocol sensitivity than
Experiment A for some of the parallel algorithms, probably due to the greater
overall runtime from using the two dimensional data decomposition.
However, Experiment B or C can be similar to Experiment A when the other is
not, so it may be more complicated than this.
In the following we summarize the parallel algorithm specific results.
To indicate the variation in performance over the set of SHMEM communication
protocols, we give the
relative difference between the median and minimum timings;
relative difference between the maximum and minimum timings;
for each of the Experiment A problem cases.
The data is presented in a table for each parallel algorithm.
The cases are not labelled in the table, but are listed in the following order:
T42 (P=16, 32, 8); T85 (P=16, 32, 8).
For brevity, we also describe the performance sensitivity to be low,
moderate, or high if the median-based statistic is <= 5%, between 5% and 15%,
or >= 15%, respectively.
Note that all timings were taken on a dedicated system, and
performance variation is not a function of interaction with other users.
DFFT
The performance variation for Experiment A is moderate:
  (med-min)/min  
  0.07  
  0.12  
  0.10  
  0.13  
  0.12  
  0.08  
  (max-min)/min  
  0.17  
  0.85  
  0.33  
  0.33  
  0.35  
  0.24  
The variation is very similar across Experiments A, B, and C,
but is somewhat higher for Experiments B and C.
The optimal protocols are clearly separated from the rest. The
set of good protocols is consistent across Experiments A, B, and C, but
when they are optimal varies somewhat between the Experiments.
c1 or d1 is a good protocol in each case.
EXCHSUM
The performance variation for Experiment A is moderate to high:
  (med-min)/min  
  0.17  
  0.14  
  0.10  
  0.13  
  0.14  
  0.07  
  (max-min)/min  
  0.40  
  0.29  
  0.20  
  0.22  
  0.29  
  0.16  
The variation is consistent across Experiments A, B, and C.
For each case, the optimal protocol is clearly better than the rest.
Unfortunately, which protocol is optimal varies with the case. However,
the set of optimal protocols is identical across Experiments A, B, and C.
a1, c1, or c2 is optimal for each case.
HALFSUM
The performance variation for Experiment A is moderate to high:
  (med-min)/min  
  0.18  
  0.22  
  0.09  
  0.14  
  0.17  
  0.08  
  (max-min)/min  
  0.35  
  0.53  
  0.19  
  0.32  
  0.39  
  0.16  
and is consistent with the variation for Experiment B. The variation
for Experiment C is smaller.
Only two of the protocols have acceptable performance, and they are
consistent across the Experiments.
a1 or c1 are optimal for each case.
There is no clear pattern as to which is better when, but neither is ever
worse than second best.
RINGPIPE
The performance variation for Experiment A is moderate:
  (med-min)/min  
  0.07  
  0.12  
  0.07  
  0.08  
  0.10  
  0.11  
  (max-min)/min  
  0.29  
  0.45  
  0.16  
  0.23  
  0.73  
  0.21  
Variation decreases slightly for Experiments B and C, but is of the same
general pattern.
Most of the protocols are reasonable performers, but
the optimal protocols are clearly separated from the
rest, and are reasonably consistent across the three Experiments.
c1 or f2 are good protocols in almost all cases.
RINGSUM
The Experiment A performance variation is moderate to high:
  (med-min)/min  
  0.24  
  0.29  
  0.09  
  0.13  
  0.20  
  0.06  
  (max-min)/min  
  0.37  
  0.47  
  0.17  
  0.23  
  0.28  
  0.11  
and is nearly identical with that for Experiment B.
The variation generally decreases for Experiment C.
There is one protocol that is near-optimal in all cases and for
Experiments, and it is clearly separated from most of the rest of the
protocols.
c1 is near-optimal in all cases.
LOGTRANS (1)
The Experiment A performance variation
for logtrans (1) is moderate to high:
  (med-min)/min  
  0.06  
  0.09  
  0.16  
  0.13  
  0.08  
  (max-min)/min  
  0.18  
  0.18  
  0.34  
  0.29  
  0.16  
Experiment B performance variation is similar in the median, while Experiment
C shows smaller variation.
The optimal protocols are clearly separated from the
rest, and reasonable agreement exists in the set of optimal protocols across
Experiments A, B, and C.
a1 or c1 are optimal in nearly all cases.
LOGTRANS (2)
The Experiment A performance variation of logtrans (2) is low to high:
  (med-min)/min  
  0.12  
  0.04  
  0.11  
  0.14  
  0.12  
  0.08  
  (max-min)/min  
  0.23  
  0.71  
  0.24  
  0.32  
  0.27  
  0.16  
and is similar to the variation for Experiments B and C.
The optimal protocols are clearly separated from the
rest, and reasonable agreement exists in the set of optimal protocols across
Experiments A, B, and C.
a1 or c1 are optimal in nearly all cases.
LOGTRANS (3)
The Experiment A performance variation is moderate:
  (med-min)/min  
  0.09  
  0.12  
  0.07  
  0.07  
  0.07  
  0.10  
  (max-min)/min  
  0.15  
  0.15  
  0.11  
  0.10  
  0.13  
  0.15  
The variation is consistent with Experiment B and somewhat larger than that
for Experiment C.
Most of the protocols are reasonable performers, but
the optimal protocols are separated from the
rest and consistent across Experiments A, B, and C.
a1 or c1 are optimal in nearly all cases.
SRTRANS (1)
The Experiment A performance variation is low to moderate:
  (med-min)/min  
  0.08  
  0.07  
  0.08  
  0.07  
  0.06  
  (max-min)/min  
  0.16  
  0.12  
  0.20  
  0.18  
  0.13  
and the variation decreases for Experiments B and C.
Most of the protocols are reasonable performers, but
the optimal protocols are still easily identified and
consistent across Experiments A, B, and C.
a1 or c1 is a good protocol for each case.
SRTRANS (2)
The Experiment A performance variation for is moderate:
  (med-min)/min  
  0.08  
  0.08  
  0.08  
  0.08  
  0.07  
  0.12  
  (max-min)/min  
  0.17  
  0.18  
  0.15  
  0.17  
  0.15  
  0.19  
Again, the variation is somewhat less in Experiments B and C than in A.
The optimal protocols are well separated from the
rest. The set of good protocols is consistent across Experiments A, B, and C,
although not what is optimal when.
a1 or c1 is optimal in each case.
SRTRANS (3)
The Experiment A performance variation is moderate to high:
  (med-min)/min  
  0.16  
  0.19  
  0.07  
  0.07  
  0.07  
  0.06  
  (max-min)/min  
  0.27  
  0.61  
  0.11  
  0.14  
  0.13  
  0.11  
The variation is slightly less in Experiment C than in Experiments A and B,
which are very similar.
The optimal protocols are well separated from the
rest, and the set of good protocols is consistent across Experiments A, B,
and C.
a1 or c1 are near-optimal in all cases.
SWTRANS (1)
The performance variation is low to moderate:
  (med-min)/min  
  0.08  
  0.05  
  0.08  
  0.06  
  0.06  
  (max-min)/min  
  0.15  
  0.10  
  0.19  
  0.63  
  0.11  
and the variation decreases for Experiments B and C.
Many of the protocols are good performers, but
the optimal protocols are clearly identifiable.
The set of optimal protocols is similar across Experiments A, B, and C,
a1 or c1 is a good protocol for nearly all cases.
SWTRANS (2)
The Experiment A performance variation for swtrans (2) is moderate:
  (med-min)/min  
  0.08  
  0.08  
  0.06  
  0.07  
  0.06  
  0.12  
  (max-min)/min  
  0.15  
  0.15  
  0.13  
  0.16  
  0.30  
  0.16  
The variation in Experiments B and C is
slightly less than that in A.
The optimal protocols are clearly identifiable, and are
similar across Experiments A, B, and C,
a1 or c1 is a good protocol for each case.
SWTRANS (3)
The Experiment A performance variation for swtrans (3) is high
for the smallest granularity cases, and low to moderate otherwise:
  (med-min)/min  
  0.16  
  0.21  
  0.05  
  0.07  
  0.09  
  0.05  
  (max-min)/min  
  0.27  
  0.32  
  0.10  
  0.13  
  0.13  
  0.09  
The variation in Experiment B is similar to that in A.
The variation in Experiment C is smaller.
The optimal protocols are clearly identifiable, and are
similar across Experiments A, B, and C,
a1 or c1 is a good protocol for each case.
Some general rules of thumb can be derived from the above data.
For ring algorithms, c1 works best for the smaller
granularities. For larger granularities, the different overlap possibilities
of the two algorithms make comparisons impossible.
a1 and c1 are good protocols to examine for
all of the transpose algorithms.
halfsum has protocol sensitivities similar to those of the
transpose algorithms.
The dfft and exchsum sensitivities are not like the
others, nor similar to each other.