PSTSWM Paragon Algorithm Comparison

Performance Studies using

PSTSWM


Intel Paragon Algorithm Comparison

(distributed LT experiment I-A1 )

Date/Person: May 13, 1998 / P. Worley
Platform: Intel Paragon XP/S 150 MP at Oak Ridge National Laboratory:
     1024 MP nodes (3 50-MHz iPSC/860 processors per node)
Environment: Paragon OSF/1 Release 1.0.4 Server 1.4 R1_4_5
f77/Paragon Paragon Version R5.0.3
Code Version: 6.3
Compilation Options: if77 -O4 -Mnodepchk -Knoieee -Msafealloc
Math Library: KAI
Communication Library: MPI
NX
Partition: 1x8, 1x16, or 1x32
Results:

Distributed LT (1) (mpi)
Algorithm Comparison
  T42L1     T21L2     T42L2     T85L2     T85L1     T85L4  
  P=32     P=16     P=8     P=32     P=16     P=8  
  optimal algorithm   halfsum  halfsum  ringpipe  ringpipe  ringpipe  ringpipe 
  (allreduce-min)/min     0.150    0.138    0.129    0.161    0.131    0.069 
  (generic-min)/min     0.559    0.379    0.114    0.168    0.129    0.042 

Distributed LT (1) (nx)
Algorithm Comparison
  T42L1     T21L2     T42L2     T85L2     T85L1     T85L4  
  P=32     P=16     P=8     P=32     P=16     P=8  
  optimal algorithm   halfsum  halfsum  ringpipe  ringpipe  ringpipe  ringpipe 
  (generic-min)/min     0.342    0.241    0.133    0.220    0.163    0.054 

Distributed LT (1) (combined)
Communication Library Comparisons
  T42L1     T21L2     T42L2     T85L2     T85L1     T85L4  
  P=32     P=16     P=8     P=32     P=16     P=8  
  optimal library   nx  nx  nx  nx  nx  nx 
  (allreduce-min)/min     0.184    0.186    0.161    0.247    0.175    0.070 
  (mpi-min)/min     0.030    0.042    0.028    0.075    0.039    0.001 

DISCUSSION

Partitions of the Paragon processor grid were used that match the processor subsets that the parallel algorithms would run on in a two dimensional data decomposition: 1x8, 1x16, and 1x32.

Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:29:30 EDT.