PSTSWM Paragon Algorithm Comparison

Performance Studies using

PSTSWM


Intel Paragon Algorithm Comparison

(distributed LT experiment II-A1 )

Date/Person: May 13, 1998 / P. Worley
Platform: Intel Paragon XP/S 150 MP at Oak Ridge National Laboratory:
     1024 MP nodes (3 50-MHz iPSC/860 processors per node)
Environment: Paragon OSF/1 Release 1.0.4 Server 1.4 R1_4_5
f77/Paragon Paragon Version R5.0.3
Code Version: 6.3
Compilation Options: if77 -O4 -Mnodepchk -Knoieee -Msafealloc
Math Library: KAI
Communication Library: MPI
NX
Partition: 1x8, 1x16, or 1x32
Results:

Distributed LT (2) (mpi)
Algorithm Comparison
  T42L1     T21L2     T42L2     T85L2     T85L1     T85L4  
  P=32     P=16     P=8     P=32     P=16     P=8  
  optimal algorithm   halfsum  halfsum  halfsum  halfsum  halfsum  ringsum 
  (allreduce-min)/min     0.150    0.138    0.053    0.137    0.062    0.040 
  (generic-min)/min     0.559    0.379    0.039    0.144    0.060    0.014 

Distributed LT (2) (nx)
Algorithm Comparison
  T42L1     T21L2     T42L2     T85L2     T85L1     T85L4  
  P=32     P=16     P=8     P=32     P=16     P=8  
  optimal algorithm   halfsum  halfsum  halfsum  halfsum  halfsum  halfsum 
  (generic-min)/min     0.342    0.241    0.040    0.124    0.059    0.028 

Distributed LT (2) (combined)
Communication Library Comparisons
  T42L1     T21L2     T42L2     T85L2     T85L1     T85L4  
  P=32     P=16     P=8     P=32     P=16     P=8  
  optimal library   nx  nx  nx  nx  nx  nx 
  (allreduce-min)/min     0.184    0.186    0.066    0.149    0.070    0.043 
  (mpi-min)/min     0.030    0.042    0.012    0.011    0.007    0.003 

DISCUSSION

Partitions of the Paragon processor grid were used that match the processor subsets that the parallel algorithms would run on in a two dimensional data decomposition: 1x8, 1x16, and 1x32.

Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:29:30 EDT.