The SGI/Cray Research Origin2000 is a Nonuniform Access Memory Access (NUMA)
shared memory architecture. The SHMEM library is built on top
of the shared memory. We use the SHMEM one-sided communication
operations (get and put) to construct a simple message passing
library that is suitable for the PSTSWM application, not implementing the more
general (and complex) semantics included with standard message-passing
libraries such as MPI and PVM.
The Origin2000s used in these experiments are located
at Los Alamos National Laboratory, where they are being used to construct
a very large parallel system. In particular, the 128 processor configurations
used in these experiments was not a supported commerical product at the time
of these experiments. In consequence, the system software changes
more rapidly than at most other Origin2000 sites, and the useful "lifespan"
of our experimental data may be relatively short.
The most important results from the protocol experiments are that
The choice of communication protocol can be important for optimizing
performance.
The optimal choice is a (complicated) function of the problem size and
number of processors.
Overlap techniques are useful for most algorithms.
For the ring algorithms, all optimal protocols use overlap techniques.
Experiments B and C attempt to introduce contention and realisitic
process placement, but there is little indication that these are factors in
the experiments.
Experiments B and C show less relative protocol sensitivity than
Experiment A for some of the parallel algorithms, probably due to the greater
overall runtime from using the two dimensional data decomposition.
However, the optimal communication protocols are sometimes different
between Experiment A and Experiments B and C, so it may be more complicated
than this.
In the following we summarize the parallel algorithm specific results.
To indicate the variation in performance over the set of SHMEM communication
protocols, we give the
relative difference between the median and minimum timings;
relative difference between the maximum and minimum timings;
for each of the Experiment A problem cases.
The data is presented in a table for each parallel algorithm.
The cases are not labelled in the table, but are listed in the following order:
T42 (P=16, 32, 8); T85 (P=16, 32, 8).
For brevity, we also describe the performance sensitivity to be low,
moderate, or high if the median-based statistic is <= 5%, between 5% and 15%,
or >= 15%, respectively.
Note that all timings were taken on a dedicated system, and
performance variation is not a function of interaction with other users.
DFFT
The performance variation for Experiment A is moderate:
  (med-min)/min  
  0.06  
  0.07  
  0.13  
  0.12  
  0.10  
  0.10  
  (max-min)/min  
  0.16  
  0.22  
  0.24  
  0.24  
  0.24  
  0.20  
and the variation is very similar across Experiments A, B, and C.
Most of the protocols are reasonable performers, but the optimal
protocols are clearly separated from the rest. The
set of good protocols is consistent across Experiments A, B, and C, but
when they are optimal varies somewhat between the Experiments.
a2 is the best protocol for a few small granularity cases.
d1 is a good protocol in the remaining cases.
EXCHSUM
The performance variation for Experiment A is moderate to high:
  (med-min)/min  
  0.17  
  0.20  
  0.14  
  0.10  
  0.16  
  0.06  
  (max-min)/min  
  0.32  
  0.31  
  0.30  
  0.20  
  0.27  
  0.14  
The variation is again relatively constant across Experiments A, B, and C.
Most of the protocols are reasonable performers, but the optimal
protocols are clearly separated from the rest. The
good protocols are identical across Experiments A, B, and C.
c2 is the best protocol for the small
problem size cases.
a2 is the best protocol for the large
problem size cases.
HALFSUM
The performance variation for Experiment A is moderate to high:
  (med-min)/min  
  0.21  
  0.30  
  0.14  
  0.20  
  0.34  
  0.08  
  (max-min)/min  
  0.33  
  0.46  
  0.21  
  0.31  
  0.52  
  0.15  
Variation decreases slightly for Experiments B and C.
Most of the protocols are reasonable performers, but some are
unacceptably bad. The optimal protocols are clearly separated from the
rest. The set of good protocols is identical across Experiments A, B, and C.
a1 and c1 are good protocols for all cases.
There is no clear pattern as to which is better when, but neither is ever
worse than second best.
RINGPIPE
The performance variation for Experiment A is moderate:
  (med-min)/min  
  0.07  
  0.11  
  0.10  
  0.11  
  0.11  
  0.08  
  (max-min)/min  
  0.28  
  0.40  
  0.19  
  0.26  
  0.30  
  0.18  
Variation decreases slightly for Experiments B and C.
Most of the protocols are reasonable performers, but
the optimal protocols are clearly separated from the
rest. The set of good protocols is identical across Experiments A, B, and C.
c1 or f2 are good protocols in almost all cases.
In general, c1 is better for the smaller granularity cases and
f2 is better for the larger granularity cases, but there are
exceptions to this rule.
RINGSUM
The Experiment A performance variation is moderate to high:
  (med-min)/min  
  0.26  
  0.34  
  0.12  
  0.16  
  0.21  
  0.07  
  (max-min)/min  
  0.41  
  0.52  
  0.18  
  0.27  
  0.36  
  0.10  
The variation generally decreases for Experiments B and C.
There is only one optimal protocol, which is the same for
Experiments A, B, and C, and it is clearly separated from the
rest.
The highest variation occurs with the cases employing large numbers of
processor (1D), where the virtual processor ring is long.
c1 is the optimal protocol in all cases.
LOGTRANS (1)
The Experiment A performance variation
for logtrans (1) is moderate to high:
  (med-min)/min  
  0.13  
  0.09  
  0.15  
  0.19  
  0.10  
  (max-min)/min  
  0.21  
  0.18  
  0.28  
  0.35  
  0.13  
The variation decreases for Experiments B and C.
Most of the protocols are reasonable performers, but
the optimal protocols are clearly separated from the
rest. The set of good protocols is identical across Experiments A, B, and C.
a1 or c1 are optimal in all cases, and are essentially
indistinguishable. a1 is a slightly more consistent performer, never
being less than second best.
LOGTRANS (2)
The Experiment A performance variation of logtrans (2) is low to high:
  (med-min)/min  
  0.13  
  0.06  
  0.12  
  0.15  
  0.26  
  0.10  
  (max-min)/min  
  0.23  
  0.12  
  0.22  
  0.25  
  0.32  
  0.15  
The variation does not differ between Experiments A, B, and C.
Most of the protocols are reasonable performers, but
the optimal protocols are clearly separated from the
rest. The set of good protocols is identical across Experiments A, B, and C.
a1 or c1 are good protocols in all cases. Unlike for
logtrans (1), they are not indistinguishable. Both should be examined
when tuning.
LOGTRANS (3)
The Experiment A performance variation is moderate:
  (med-min)/min  
  0.10  
  0.11  
  0.07  
  0.08  
  0.10  
  0.11  
  (max-min)/min  
  0.13  
  0.15  
  0.10  
  0.12  
  0.14  
  0.15  
and the variation is consistent across Experiments A, B, and C.
Most of the protocols are reasonable performers, but
the optimal protocols are generally separated from the
rest. The good protocols are similar across Experiments A, B, and C.
a1 and c1 are good protocols for the small granularity
cases.
a2 is a good protocol for the large granularity cases.
SRTRANS (1)
The Experiment A performance variation is low to moderate:
  (med-min)/min  
  0.10  
  0.07  
  0.07  
  0.08  
  0.05  
  (max-min)/min  
  0.20  
  0.10  
  0.17  
  0.23  
  0.10  
and the variation
decreases for Experiments B and C.
Most of the protocols are reasonable performers, and
the optimal protocols are not always well separated from the
rest. The set of good protocols is consistent across Experiments A, B, and C.
a1 is a good protocol for all cases.
SRTRANS (2)
The Experiment A performance variation for is moderate:
  (med-min)/min  
  0.09  
  0.11  
  0.09  
  0.07  
  0.07  
  0.10  
  (max-min)/min  
  0.17  
  0.27  
  0.15  
  0.13  
  0.21  
  0.15  
Again, the variation is less in Experiments B and C than in A.
Most of the protocols are reasonable performers, but
the optimal protocols are well separated from the
rest. The set of good protocols is consistent across Experiments A, B, and C,
although not what is optimal when.
a1 or c1 are optimal in all cases. They are not
indistinguishable, however, and there is no pattern across the different
experiments (A, B, C) to determine which to use when.
SRTRANS (3)
The Experiment A performance variation is moderate to high:
  (med-min)/min  
  0.14  
  0.19  
  0.07  
  0.06  
  0.08  
  0.06  
  (max-min)/min  
  0.25  
  0.35  
  0.12  
  0.11  
  0.14  
  0.09  
The variation is slightly less in Experiments B and C than in Experiment A.
Most of the protocols are reasonable performers, but
the optimal protocols are well separated from the
rest. The set of good protocols is consistent across Experiments A, B, and C.
a1 or c1 are optimal in all cases. They are not
indistinguishable, however, and there is no obvious pattern across the
different experiemnts (A, B, C) to determine which to use when.
SWTRANS (1)
The performance variation is low to moderate:
  (med-min)/min  
  0.11  
  0.07  
  0.07  
  0.09  
  0.05  
  (max-min)/min  
  0.20  
  0.09  
  0.17  
  0.22  
  0.08  
and the variation decreases for Experiments B and C.
Most of the protocols are good performers, but
the optimal protocols are clearly identifiable.
The set of optimal protocols is similar across Experiments A, B, and C,
a1 is a good protocol in all cases.
SWTRANS (2)
The Experiment A performance variation for swtrans (2) is moderate:
  (med-min)/min  
  0.09  
  0.12  
  0.08  
  0.08  
  0.07  
  0.09  
  (max-min)/min  
  0.16  
  0.23  
  0.14  
  0.12  
  0.14  
  0.13  
The variation in Experiments B and C is
similar to or slightly less than that in A.
Most of the protocols are reasonable performers, but
the optimal protocols are clearly identifiable.
The set of optimal protocols is similar across Experiments A, B, and C,
a1 is a good protocol in all cases for Experiments B and C.
For Experiment A, a1 or c1 are good protocols, but
there is no obvious pattern as to which to use when.
SWTRANS (3)
The Experiment A performance variation for swtrans (3) is high
for the smallest granularity cases, and low to moderate otherwise:
  (med-min)/min  
  0.16  
  0.21  
  0.07  
  0.07  
  0.09  
  0.05  
  (max-min)/min  
  0.24  
  0.33  
  0.12  
  0.10  
  0.18  
  0.09  
The variation in Experiments B and C is similar to that in A.
Most of the protocols are good performers, but
the optimal protocols are clearly identifiable.
The set of optimal protocols is similar across Experiments A, B, and C,
a1 is a good protocol in all cases for Experiments B and C.
For Experiment A, c1 is the best protocols in all but the largest
granularity case, in which case a1 is the best protocol.
Some general rules of thumb can be derived from the above data.
For ring algorithms, c1 works best for the smaller
granularities. For larger granularities, the different overlap possibilities
of the two algorithms make comparisons impossible.
a1 and c1 are good protocols to examine for
almost all of the transpose algorithms. Including a2 as
well covers all of the different cases.
halfsum has protocol sensitivities similar to those of the
transpose algorithms.
The dfft and exchsum sensitivities are not like any of the
others, nor similar to each other.