PSTSWM T3D SHMEM Protocol Performance Summary

Performance Studies using

PSTSWM


Cray Research T3D Protocol Performance

(SHMEM Summary - October 28, 1994)

The Cray Research T3D is a distributed-memory parallel architecture built around a high-performance 3D torus interconnect. The T3D used in these experiments was a development machine at Cray Research in Eagen, MN. While processors were not shared, the processor configuration and the effect of other users on interprocessor communication performance were not directly controllable unless dedicated time was requested. To minimize these effects, tests were repeated to identify (and eliminate) bad timings. The deterministic nature of the T3D timings when isolated from external effects made the identification of perturbations fairly simple.

In these experiments we examine the protocol sensitivity of an application-specific communication library built on top of the SHMEM one-sided communication operations get and put. This library attempts to decrease the communication overhead and latency and increase the bandwidth over that of the more general libraries such as MPI and PVM.

The results described here come from legacy data collected over three years ago. The particular T3D used has long since been dismantled, but these results should still accurately reflect the interprocessor communication performance of existing T3D systems. The data is particularly interesting as a way to examine how communication sensitivity has changed between the T3D and the T3E.

At the time of these experiments, we were running only Experiment A.

The most important results from these SHMEM protocol experiments are that

Below, we summarize the parallel algorithm specific results. To indicate the variation in performance over the set of MPI communication protocols, we give the

for each of the Experiment A problem cases. The data is presented in a table for each parallel algorithm. The cases are not labelled in the table, but are listed in the following order: T42 (P=16, 32, 8); T85 (P=16, 32, 8). (The T85, P=8 case is missing for ringpipe, ringsum, srtrans, and swtrans.) For brevity, we also describe the performance sensitivity to be low, moderate, or high if the median-based statistic is <= 5%, between 5% and 15%, or >= 15%, respectively.

The following observations apply to all of the algorithm results and are listed here to cut down on the repetition:

DFFT
EXCHSUM
HALFSUM
RINGPIPE
RINGSUM
LOGTRANS (1)
LOGTRANS (2)
LOGTRANS (3)
SRTRANS (1)
SRTRANS (2)
SRTRANS (3)
SWTRANS (1)
SWTRANS (2)
SWTRANS (3)

PSTSWM Performance Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:24:01 EDT.
80720 accesses since 1/2/96.