PSTSWM SP3-375 Serial Performance

Performance Studies using

PSTSWM


IBM SP3-375 Winterhawk II Serial Performance

(Using ESSL Math Library)

Date/Person: December 11, 1999 / P. Worley
Platform: IBM SP3 at Oak Ridge National Laboratory (bobcat.ccs.ornl.gov):
     62 4-way Winterhawk II SMP nodes (375 MHz POWER3 with 8MB L2 cache)
Environment: AIX 4.3.3;   PSSP 3.1.1
Code Version: 6.6.4
Make Options: MACH=sp COMM=mpi PRECISION=8 PERF=n WORKSPACE=20000000 MATH=essl
Compilation Options: mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qstrict
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave -qstrict
    or mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto -qhot
    or mpxlf -O4 -qarch=auto -qtune=auto -qcache=auto
Link Options: -bmaxdata:0x70000000
    or -bmaxdata:0x70000000 -qsmp=noauto
Number of steps: T42: 241 or 481
T85: 49 or 97
T170: 49 or 97
Notes: using ESSL library routines for Fourier transforms and BLAS (-lessl or -lessl_r)

Not Reentrant

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.011 0.022 0.033 0.236
T85 0.067 0.146 0.230 1.505
T170 0.481 1.015 1.570  
Problem L1 L2 L3 L16
T42 0.008 0.016 0.024 0.189
T85 0.046 0.105 0.168 1.219
T170 0.327 0.737 1.183  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 386.5 382.1 373.1 280.2
T85 363.8 332.3 316.5 257.7
T170 318.2 301.6 292.4  
Problem L1 L2 L3 L16
T42 531.7 523.1 509.7 349.2
T85 529.0 463.7 434.0 318.1
T170 467.9 415.0 388.1  

Reentrant

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.011 0.022 0.034 0.241
T85 0.069 0.148 0.234 1.522
T170 0.490 1.025 1.590  
Problem L1 L2 L3 L16
T42 0.008 0.017 0.026 0.197
T85 0.048 0.107 0.173 1.260
T170 0.341 0.754 1.210  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 385.9 372.0 360.3 274.0
T85 351.7 327.1 310.5 254.9
T170 312.3 298.5 288.6  
Problem L1 L2 L3 L16
T42 530.5 499.0 484.5 335.7
T85 507.4 451.4 419.9 307.9
T170 448.2 405.7 379.4  

More Aggressive Optimization

with -qhot, without -qstrict

with -O4, without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.008 0.017 0.026 0.198
T85 0.048 0.107 0.173 1.263
T170 0.342 0.755 1.215  
Problem L1 L2 L3 L16
T42 0.008 0.016 0.026 0.198
T85 0.047 0.106 0.171 1.257
T170 0.339 0.757 1.211  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 530.7 495.6 478.0 333.2
T85 509.9 454.4 420.1 306.9
T170 446.9 405.3 377.8  
Problem L1 L2 L3 L16
T42 532.6 502.9 481.3 334.1
T85 516.4 458.8 425.3 308.5
T170 451.6 404.4 379.2  

DISCUSSION

The "reentrant" compiler mpxlf_r is needed in order to link in the threaded version of MPI, and will be used in any production parallel runs. Experiments using mpxlf (not shown here) show no performance degradation from using mpxlf_r.
 
The reentrant compile options (using "mpxlf_r", "-qsmp=noauto", "-qnosave") are the first step in exploiting shared memory parallelism. These experiments look at the cost of making these options the default. The results indicate that the impact is less than 5%.
 
The "-qstrict" compile option ensures that optimizations enabled by "-O3" do not change the semantics of the program. As can be seen above, the "-qstrict" option has a significant impact on performance. The accuracy of the simulation calculated by PSTSWM is not degraded by not using "-qstrict". However, "-qstrict" may be needed to ensure reproducibility in the full atmospheric model CCM/MP-2D. So the performance when specifying "-qstrict" may be more representative of what can safely be obtained when using the ORNL SP3.
 
The last two experiments examine the advantage of applying higher order loop transformation (-O3 -qhot) and more aggressive optimizations (-O4). Applied globally, as done here, shows a slight degradation in performance as compared to the -O3 experiments.

Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:19:33 EDT.
82788 accesses since 1/2/96.