The SGI/Cray Research T3E-900 is a distributed-memory parallel
architecture built around a high-performance 3D torus interconnect.
The T3E used in these experiments is a production machine managed
by the National Energy Research Scientific Computing Center (NERSC).
While processors are not shared, the processor configuration and the
effect of other users on interprocessor communication performance
is not directly controllable. To minimize these effects, larger than
necessary partitions were requested for these runs, and the complete
experimental suite was rerun if evidence of unacceptable timing perturbations
was found. The deterministic nature of the T3E timings when isolated from
external effects made the identification of perturbations fairly simple.
In these experiments we examine the protocol sensitivity of an
application-specific communication library built on top of the
SHMEM one-sided communication operations get and put.
This library attempts to decrease the communication overhead and latency
and increase the bandwidth over that of the more general libraries
such as MPI and PVM.
The most important results from these SHMEM protocol experiments are that
The choice of communication protocol can be important for optimizing
performance. In particular, identifying and avoiding "bad" protocols
may be important to achieving good performance.
The choice of protocol is most important in the cases employing large
numbers of processors.
For the most part, the optimal protocols not exploiting overlap
use nonblocking send (based on get),
while the overlap protocols use nonblocking receive (based on put).
Overlap techniques are useful for most algorithms, and are especially
important for ringpipe.
The Experiment A results (essentially contention free) do not differ
significantly from the Experiment B and C results (which attempt to include
contention and more realistic process placement) in that the optimal
protocols are the same and relative performance variation stays the same or
decreases for Experiments B and C. This is the hoped for result for a
distributed memory architecture on a high performance network, and makes it
easier to trust the traditional low-level benchmarks results when determining
tuning parameters.
Below, we summarize the parallel algorithm specific results.
To indicate the variation in performance over the set of MPI communication
protocols, we give the
relative difference between the median and minimum timings;
relative difference between the maximum and minimum timings;
for each of the Experiment A problem cases.
The data is presented in a table for each parallel algorithm.
The cases are not labelled in the table, but are listed in the following order:
T42 (P=16, 32, 8); T85 (P=16, 32, 8).
For brevity, we also describe the performance sensitivity to be low,
moderate, or high if the median-based statistic is <= 5%, between 5% and 15%,
or >= 15%, respectively.
The following observations apply to all of the algorithm results
and are listed here to cut down on the repetition:
The percentage variation in performance
is higher for Experiment A than for Experiments B and C.
The majority of the protocols perform reasonably well, but the optimal
protocols are clearly identified and some protocols perform
poorly for certain cases. (Exceptions to this are noted
for particular algorithms.)
The performance variation is roughly proportional to the number of
processors (1D) and inversely proportional to the problem granularity.
(The details of the relationship vary between the different parallel
algorithms, and is less pronounced for the transpose algorithms.)
DFFT
The Experiment A performance variation is moderate for the small
granularity cases, and low otherwise:
  (med-min)/min  
  0.11  
  0.07  
  0.02  
  0.05  
  0.01  
  0.03  
  (max-min)/min  
  0.17  
  0.13  
  0.11  
  0.09  
  0.09  
  0.09  
The set of optimal protocols are consistent across
Experiments A, B, and C, but the particulars as to which is best when
varies.
a1 or d1 are good protocols in all cases.
a1 tends to be better for the small granularity cases, and d1
tends to be better for the large granularities.
EXCHSUM
The Experiment A performance variation is low for the largest
granularity, and moderate to high otherwise:
  (med-min)/min  
  0.12  
  0.16  
  0.08  
  0.08  
  0.08  
  0.04  
  (max-min)/min  
  0.56  
  0.60  
  0.25  
  0.37  
  0.54  
  0.19  
The set of good protocols is the same across
Experiments A, B, and C, but the particulars as to which is best when
varies.
c2 is a good protocol in all cases.
a1 is optimal in the one cases when c2 is not
within 1% of the optimal.
HALFSUM
The Experiment A performance variation is low for the large granularity
case, moderate for the medium granularity cases, and high for the small
granularity cases:
  (med-min)/min  
  0.17  
  0.18  
  0.04  
  0.07  
  0.08  
  0.03  
  (max-min)/min  
  0.22  
  0.24  
  0.08  
  0.13  
  0.22  
  0.07  
The good protocols are the same across
Experiments A, B, and C.
a1 is a good protocol for all cases.
RINGPIPE
The Experiment A performance variation is moderate to high:
  (med-min)/min  
  0.13  
  0.10  
  0.08  
  0.11  
  0.17  
  0.09  
  (max-min)/min  
  0.62  
  0.40  
  0.20  
  0.27  
  0.38  
  0.14  
The best protocols are consistent across Experiments A, B,
and C.
a1 or i2 are good protocols for the small granularity cases.
i2 is a good protocol for the medium and large granularity cases.
RINGSUM
The Experiment A performance variation is low for the large granularity
case, moderate for the medium granularity cases, and high for the small
granularity cases:
  (med-min)/min  
  0.20  
  0.21  
  0.07  
  0.08  
  0.13  
  0.04  
  (max-min)/min  
  0.25  
  0.28  
  0.12  
  0.14  
  0.22  
  0.07  
The best protocols are consistent across Experiments A, B,
and C, but the particulars as to which is best when
varies.
c2 is a good protocol for all cases for Experiments B and C.
a1 is optimal in 3 of the six cases for Experiment A,
with c2 optimal in the others cases.
LOGTRANS (1)
The Experiment A performance variation of logtrans (1) is low to
moderate:
  (med-min)/min  
  0.09  
  0.05  
  0.07  
  0.08  
  0.04  
  (max-min)/min  
  0.16  
  0.17  
  0.18  
  0.18  
  0.15  
The set of good protocols are consistent across Experiments A, B,
and C, but the particulars as to which is best when
varies.
a1 or c2 are good protocols for all cases.
LOGTRANS (2)
The Experiment A performance variation of logtrans (2) is low to
moderate:
  (med-min)/min  
  0.08  
  0.05  
  0.06  
  0.06  
  0.05  
  0.04  
  (max-min)/min  
  0.24  
  0.20  
  0.22  
  0.24  
  0.16  
  0.23  
The set of good protocols are consistent across Experiments A, B,
and C, but the particulars as to which is best when
varies.
a1 or c2 are good protocols for all cases.
LOGTRANS (3)
The Experiment A performance variation of logtrans (3) is
moderate for the small granularity cases, and low otherwise:
  (med-min)/min  
  0.13  
  0.11  
  0.04  
  0.03  
  0.05  
  0.02  
  (max-min)/min  
  0.54  
  0.24  
  0.18  
  0.17  
  0.19  
  0.14  
The good protocols are consistent across Experiments A, B,
and C.
a1 or c2 are good protocols for all cases.
SRTRANS (1)
The Experiment A performance variation
of srtrans (1) is low to moderate:
  (med-min)/min  
  0.12  
  0.04  
  0.06  
  0.07  
  0.02  
  (max-min)/min  
  0.17  
  0.09  
  0.09  
  0.09  
  0.04  
The set of good protocols are consistent across Experiments A, B,
and C, but the particulars as to which is best when
varies.
c1 is a good protocol for almost all cases.
a1 is a good protocol for the remaining cases.
SRTRANS (2)
The Experiment A performance variation
of srtrans (2) is moderate for the small granularity cases, and low
otherwise:
  (med-min)/min  
  0.08  
  0.09  
  0.02  
  0.04  
  0.03  
  0.02  
  (max-min)/min  
  0.16  
  0.22  
  0.09  
  0.11  
  0.09  
  0.04  
The set of good protocols are similar across Experiments A, B,
and C, but the particulars as to which is best when
varies.
c1 is a good protocol for most all cases.
a1 is a good protocol for the remaining cases.
SRTRANS (3)
The Experiment A performance variation is moderate to high for the small
granularity cases, and low otherwise:
  (med-min)/min  
  0.13  
  0.18  
  0.03  
  0.03  
  0.04  
  0.01  
  (max-min)/min  
  0.77  
  0.27  
  0.08  
  0.06  
  0.09  
  0.02  
The set of good protocols is consistent across Experiments A, B,
and C, but the particulars as to which is best when
varies.
c1 is a good protocol for most all cases.
a1 is a good protocol for the remaining cases.
SWTRANS (1)
The Experiment A performance variation
of swtrans (1) is moderate for the small
granularity cases, and low otherwise:
  (med-min)/min  
  0.07  
  0.02  
  0.04  
  0.04  
  0.01  
  (max-min)/min  
  0.17  
  0.06  
  0.06  
  0.07  
  0.03  
The set of good protocols is similar across Experiments A, B,
and C, but so many of the protocols perform well that is is difficult to
identify a small number of "best" protocols.
a1, c1, or c2 are good protocols in all cases.
SWTRANS (2)
The Experiment A performance variation
of swtrans (2) is moderate for the small
granularity cases, and low otherwise:
  (med-min)/min  
  0.06  
  0.08  
  0.02  
  0.02  
  0.02  
  0.01  
  (max-min)/min  
  0.45  
  0.18  
  0.07  
  0.08  
  0.06  
  0.03  
The set of good protocols is similar across Experiments A, B,
and C, but there are differences. On the other hand, so many of the protocols
perform well that is is difficult to identify a small number of "best"
protocols.
a1, c1, or c2 are good protocols in most cases.
SWTRANS (3)
The Experiment A performance variation is moderate to high for the small
granularity cases, and low otherwise:
  (med-min)/min  
  0.14  
  0.24  
  0.02  
  0.02  
  0.03  
  0.01  
  (max-min)/min  
  0.21  
  0.31  
  0.05  
  1.64  
  0.08  
  0.02  
The good protocols are consistent across Experiments A, B,
and C.
c1 or c2 are good protocols in most cases.
In additional to those mentioned earlier,
some general rules of thumb appear to apply.
Each of the algorithms has its own peculiarities and sensitivities, but
in general a1, c1, and c2 are the best protocols.
The two exceptions to this rule are the "overlap intensive" algorithms.
For dfft, the overlap protocol d1 should be examined.
For ringpipe, the overlap protocol i2 should be examined.