PSTSWM SP3-375 Serial Performance

Performance Studies using

PSTSWM


IBM SP3-375 Winterhawk II Serial Performance

Date/Person: December 11, 1999 / P. Worley
Platform: IBM SP3 at Oak Ridge National Laboratory (bobcat.ccs.anl.gov):
     62 4-way Winterhawk II SMP nodes (375 MHz POWER3 with 8MB L2 cache)
Environment: AIX 4.3.3;   PSSP 3.1.1
Code Version: 6.6.4
Make Options: MACH=sp COMM=mpi PRECISION=8 PERF=n WORKSPACE=20000000
Compilation Options: mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qstrict
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave -qstrict
    or mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto -qhot
    or mpxlf -O4 -qarch=auto -qtune=auto -qcache=auto
Link Options: -bmaxdata:0x70000000
    or -bmaxdata:0x70000000 -qsmp=noauto
Number of steps: T42: 241 or 481
T85: 49 or 97
T170: 49 or 97
Notes: using PSTSWM Fortran routines for Fourier transforms and BLAS

Not Reentrant

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.013 0.026 0.041 0.285
T85 0.076 0.166 0.261 1.722
T170 0.526 1.115 1.726  
Problem L1 L2 L3 L16
T42 0.010 0.021 0.032 0.240
T85 0.055 0.122 0.198 1.461
T170 0.381 0.848 1.360  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 314.9 312.4 305.8 231.9
T85 318.3 292.4 278.4 225.2
T170 290.8 274.5 265.9  
Problem L1 L2 L3 L16
T42 401.1 399.1 385.3 275.9
T85 437.8 396.3 366.3 265.4
T170 401.5 361.0 337.5  

Reentrant

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.013 0.027 0.042 0.295
T85 0.078 0.169 0.267 1.768
T170 0.535 1.127 1.760  
Problem L1 L2 L3 L16
T42 0.010 0.021 0.033 0.251
T85 0.058 0.127 0.206 1.529
T170 0.392 0.887 1.405  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 310.3 304.3 295.4 224.1
T85 310.8 287.5 272.8 219.3
T170 285.9 271.5 260.8  
Problem L1 L2 L3 L16
T42 396.1 384.5 376.8 263.4
T85 419.9 381.8 352.3 253.6
T170 390.8 344.9 326.7  

More Aggressive Optimization

with -qhot, without -qstrict

with -O4, without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.011 0.022 0.034 0.257
T85 0.058 0.129 0.208 1.555
T170 0.393 0.884 1.421  
Problem L1 L2 L3 L16
T42 0.010 0.021 0.032 0.242
T85 0.057 0.126 0.204 1.514
T170 0.379 0.854 1.368  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 386.9 372.4 364.7 257.5
T85 416.2 375.8 349.2 249.4
T170 388.9 346.0 323.1  
Problem L1 L2 L3 L16
T42 421.6 401.3 392.9 273.4
T85 426.6 383.5 356.2 256.1
T170 403.9 358.3 335.5  

DISCUSSION

The "reentrant" compiler mpxlf_r is needed in order to link in the threaded version of MPI, and will be used in any production parallel runs. Experiments using mpxlf (not shown here) show no performance degradation from using mpxlf_r.
 
The reentrant compile options (using "mpxlf_r", "-qsmp=noauto", "-qnosave") are the first step in exploiting shared memory parallelism. These experiments look at the cost of making these options the default. The results indicate that the impact is less than 5%.
 
The "-qstrict" compile option ensures that optimizations enabled by "-O3" do not change the semantics of the program. As can be seen above, the "-qstrict" option has a significant impact on performance. The accuracy of the simulation calculated by PSTSWM is not degraded by not using "-qstrict". However, "-qstrict" may be needed to ensure reproducibility in the full atmospheric model CCM/MP-2D. So the performance when specifying "-qstrict" may be more representative of what can safely be obtained when using the ORNL SP3.
 
The last two experiments examine the advantage of applying higher order loop transformations (-O3 -qhot) and more aggressive optimizations (-O4). Applied globally, as done here, shows a slight improvement for a few of the -O4 experiments, but, in general, causes a slight degradation in performance as compared to the -O3 experiments.

Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:19:34 EDT.
3500 accesses since 1/2/96.