The Cray Research T3D is a distributed-memory parallel
architecture built around a high-performance 3D torus interconnect.
The T3D used in these experiments was a development machine at
Cray Research in Eagen, MN.
While processors were not shared, the processor configuration and the
effect of other users on interprocessor communication performance
were not directly controllable unless dedicated time was requested.
To minimize these effects, tests were repeated to identify
(and eliminate) bad timings.
The deterministic nature of the T3D timings when isolated from
external effects made the identification of perturbations fairly simple.
In these experiments we examine the protocol sensitivity of an
application-specific communication library built on top of the
SHMEM one-sided communication operations get and put.
This library attempts to decrease the communication overhead and latency
and increase the bandwidth over that of the more general libraries
such as MPI and PVM.
The results described here come from legacy data collected over three years
ago. The particular T3D used has long since been dismantled, but these results
should still accurately reflect the interprocessor communication performance
of existing T3D systems.
The data is particularly interesting as a way to examine how communication
sensitivity has changed between the T3D and the T3E.
At the time of these experiments, we were running only Experiment A.
The most important results from these SHMEM protocol experiments are that
The choice of communication protocol is important for optimizing
performance, but only to identify and avoid "bad" protocols.
"Bad" protocols use get, while "good" protocols use put.
The optimal protocol is always the simplest nonblocking receive protocol:
(a2 for nonoverlap, and c2 or i2 for overlap.
Overlap techniques are useful in special cases, but are only strongly
indicated for ringpipe.
Below, we summarize the parallel algorithm specific results.
To indicate the variation in performance over the set of MPI communication
protocols, we give the
relative difference between the median and minimum timings;
relative difference between the maximum and minimum timings;
for each of the Experiment A problem cases.
The data is presented in a table for each parallel algorithm.
The cases are not labelled in the table, but are listed in the following order:
T42 (P=16, 32, 8); T85 (P=16, 32, 8).
(The T85, P=8 case is missing for ringpipe, ringsum,
srtrans, and swtrans.)
For brevity, we also describe the performance sensitivity to be low,
moderate, or high if the median-based statistic is <= 5%, between 5% and 15%,
or >= 15%, respectively.
The following observations apply to all of the algorithm results
and are listed here to cut down on the repetition:
There is a sharp dichotomy between the protocols using get and
put, with little variation within the set of get-based
protocols and within the set of put-based protocols, but a
performance gap between the two. The put-based protocols are always
the better performers. This explains the low median statistics and the high
maximum statistic results.
Even with the bimodal distribution of the protocols, the optimum is
readily identifiable. However, most put-based protocols are
reasonable performers.
DFFT
The performance variation is very low (in the median statistic):
  (med-min)/min  
  0.03  
  0.03  
  0.02  
  0.01  
  0.02  
  0.01  
  (max-min)/min  
  0.26  
  0.23  
  0.21  
  0.17  
  0.20  
  0.18  
a2 is a good protocol in all cases.
EXCHSUM
The performance variation is low to moderate:
  (med-min)/min  
  0.06  
  0.05  
  0.01  
  0.06  
  0.02  
  0.03  
  (max-min)/min  
  0.43  
  0.52  
  0.16  
  0.30  
  0.41  
  0.14  
a2 is a good protocol in all cases.
HALFSUM
The performance variation is low:
  (med-min)/min  
  0.04  
  0.05  
  0.02  
  0.02  
  0.03  
  0.01  
  (max-min)/min  
  0.24  
  0.32  
  0.11  
  0.14  
  0.24  
  0.07  
a2 or c2 are good protocols in all cases.
RINGPIPE
The performance variation is low for the medium granularity cases and
moderate for the small granularity cases:
  (med-min)/min  
  0.11  
  0.12  
  0.02  
  0.04  
  0.06  
  (max-min)/min  
  0.29  
  0.34  
  0.12  
  0.16  
  0.28  
a2 is optimal for the small granularity cases.
i2 is optimal for the medium and large granularity cases.
RINGSUM
The performance variation is low for the medium granularity cases and
moderate for the small granularity cases:
  (med-min)/min  
  0.06  
  0.07  
  0.01  
  0.02  
  0.03  
  (max-min)/min  
  0.27  
  0.32  
  0.10  
  0.13  
  0.23  
a2 or c2 are good protocols in all cases.
LOGTRANS (1)
The performance variation of logtrans (1) is low:
  (med-min)/min  
  0.03  
  0.02  
  0.01  
  0.03  
  0.01  
  (max-min)/min  
  0.25  
  0.17  
  0.21  
  0.26  
  0.13  
a2 is a good protocol in all cases.
LOGTRANS (2)
The performance variation of logtrans (2) is low:
  (med-min)/min  
  0.02  
  0.02  
  0.01  
  0.02  
  0.02  
  0.01  
  (max-min)/min  
  0.27  
  0.20  
  0.21  
  0.26  
  0.29  
  0.17  
a2 is a good protocol in all cases.
LOGTRANS (3)
The performance variation of logtrans (3) is low:
  (med-min)/min  
  0.03  
  0.04  
  0.01  
  0.03  
  0.01  
  0.01  
  (max-min)/min  
  0.17  
  0.18  
  0.11  
  0.12  
  0.10  
  0.01  
a2 or c2 are good protocols in all cases.
Note that timings for put-based protocols are missing for
case T85, P=8.
SRTRANS (1)
The performance variation of srtrans (1) is low:
  (med-min)/min  
  0.04  
  0.02  
  0.02  
  0.03  
  (max-min)/min  
  0.16  
  0.11  
  0.12  
  0.12  
a2 and c2 are both good protocols in all cases
(and are essentially indistinguishable).
SRTRANS (2)
The performance variation of srtrans (2) is low:
  (med-min)/min  
  0.04  
  0.05  
  0.03  
  0.04  
  0.04  
  (max-min)/min  
  0.18  
  0.18  
  0.14  
  0.14  
  0.14  
a2 and c2 are both good protocols in all cases
(and are essentially indistinguishable).
SRTRANS (3)
The performance variation is moderate to high for the small
granularity cases, and low otherwise:
  (med-min)/min  
  0.05  
  0.07  
  0.01  
  0.01  
  0.01  
  (max-min)/min  
  0.16  
  0.17  
  0.08  
  0.05  
  0.05  
a2 and c2 are both good protocols in all but the smallest
granularity case (and are essentially indistinguishable).
e2 is optimal in the other case.
SWTRANS (1)
The performance variation of swtrans (1) is low for the medium
granularity cases and moderate for the small granularity case:
  (med-min)/min  
  0.06  
  0.01  
  0.03  
  0.02  
  (max-min)/min  
  0.17  
  0.11  
  0.11  
  0.12  
a2 is optimal for all cases.
SWTRANS (2)
The performance variation of swtrans (2) is low for the medium
granularity cases and moderate for the small grnaularity cases:
  (med-min)/min  
  0.05  
  0.07  
  0.02  
  0.04  
  0.03  
  (max-min)/min  
  0.18  
  0.15  
  0.14  
  0.14  
  0.14  
a2 is optimal for all cases.
SWTRANS (3)
The performance variation of swtrans (2) is low for the medium
granularity cases and moderate for the small grnaularity cases:
  (med-min)/min  
  0.07  
  0.06  
  0.01  
  0.01  
  0.01  
  (max-min)/min  
  0.16  
  0.16  
  0.08  
  0.06  
  0.05  
a2 is a good protocol in all but the smallest
granularity case.