PSTSWM Paragon Algorithm Comparison

Performance Studies using

PSTSWM


Intel Paragon Algorithm Comparison

(transpose FFT experiment A1 )

Date/Person: May 13, 1998 / P. Worley
Platform: Intel Paragon XP/S 150 MP at Oak Ridge National Laboratory:
     1024 MP nodes (3 50-MHz iPSC/860 processors per node)
Environment: Paragon OSF/1 Release 1.0.4 Server 1.4 R1_4_5
f77/Paragon Paragon Version R5.0.3
Code Version: 6.3
Compilation Options: if77 -O4 -Mnodepchk -Knoieee -Msafealloc
Math Library: KAI
Communication Library: MPI
NX
Partition: 1x8, 1x16, or 1x32
Results:

Transpose FFT (mpi)
Algorithm Comparison
  T10L16    T10L16    T21L8     T21L32    T21L16    T42L16 
  P=32     P=16     P=8     P=32     P=16     P=8  
  optimal algorithm   alltoallv  srtrans  swtrans  swtrans  swtrans  swtrans 
  (alltoallv-min)/min     0.000    0.004    0.100    0.149    0.096    0.077 
  (generic-min)/min     0.116    0.016    0.017    0.061    0.025    0.023 

Transpose FFT (nx)
Algorithm Comparison
  T10L16    T10L16    T21L8     T21L32    T21L16    T42L16 
  P=32     P=16     P=8     P=32     P=16     P=8  
  optimal algorithm   srtrans  swtrans  swtrans  swtrans  swtrans  swtrans 
  (generic-min)/min     0.000    0.025    0.009    0.067    0.025    0.020 

Transpose FFT (combined)
Communication Library Comparisons
  T10L16    T10L16    T21L8     T21L32    T21L16    T42L16 
  P=32     P=16     P=8     P=32     P=16     P=8  
  optimal library   nx  nx  nx  nx  nx  nx 
  (alltoallv-min)/min     0.166    0.162    0.125    0.250    0.155    0.080 
  (mpi-min)/min     0.166    0.157    0.023    0.088    0.054    0.002 

DISCUSSION

Partitions of the Paragon processor grid were used that match the processor subsets that the parallel algorithms would run on in a two dimensional data decomposition: 1x8, 1x16, and 1x32.

Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:30:23 EDT.