PSTSWM SP3-200 Serial Performance

Performance Studies using

PSTSWM


IBM SP3-200 Winterhawk I Serial Performance

(Using ESSL Math Library)

Date/Person: October 12, 1999 / P. Worley
Platform: IBM SP3 at Oak Ridge National Laboratory (morgan.ccs.anl.gov):
     62 2-way Winterhawk I SMP nodes (200 MHz POWER3 with 4MB L2 cache, equivalent to RS/6000 Model 260)
Environment: AIX 4.3.2;   PSSP 3.1
Code Version: 6.4.4
Make Options: MACH=sp COMM=mpi PRECISION=8 PERF=n WORKSPACE=20000000 MATH=essl
Compilation Options: mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto
    or mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto -qstrict
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave -qstrict
Link Options: -bmaxdata:0x70000000
    or -bmaxdata:0x70000000 -qsmp=noauto
Number of steps: T42: 241 or 481
T85: 49 or 97
T170: 49 or 97
Notes: using ESSL library routines for Fourier transforms and BLAS (-lessl or -lessl_r)

Not Reentrant

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.022 0.045 0.071 0.424
T85 0.143 0.296 0.451 2.568
T170 0.957 1.948 2.954  
Problem L1 L2 L3 L16
T42 0.016 0.034 0.053 0.324
T85 0.098 0.204 0.310 1.919
T170 0.622 1.328 2.006  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 189.8 183.0 175.2 155.8
T85 169.8 163.9 161.2 151.0
T170 159.9 157.1 155.4  
Problem L1 L2 L3 L16
T42 251.5 240.8 233.5 204.0
T85 246.2 237.1 234.4 202.1
T170 246.1 230.5 228.8  

Reentrant

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.023 0.049 0.074 0.441
T85 0.149 0.303 0.463 2.665
T170 0.985 2.002 3.031  
Problem L1 L2 L3 L16
T42 0.016 0.035 0.055 0.348
T85 0.102 0.211 0.327 1.996
T170 0.635 1.328 2.043  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 177.5 170.2 167.3 149.9
T85 162.7 159.9 157.2 145.5
T170 155.4 152.9 151.5  
Problem L1 L2 L3 L16
T42 251.5 233.6 226.6 190.0
T85 238.1 229.9 222.3 194.3
T170 241.1 230.4 224.7  

DISCUSSION

The reentrant compile options (using "mpxlf_r", "-qsmp=noauto", "-qnosave") are the first step in exploiting shared memory parallelism. These experiments look at the cost of making these options the default. The results indicate that the impact is less than 5%.
 
The "-qstrict" compile option ensures that optimizations enabled by "-O3" do not change the semantics of the program. As can be seen above, the "-qstrict" option has a significant impact on performance. The accuracy of the simulation calculated by PSTSWM is not degraded by not using "-qstrict". However, "-qstrict" may be needed to ensure reproducibility in the full atmospheric model CCM/MP-2D, and it is currently required when using the "reentrant" compile options with CCM/MP-2D. (The code aborts otherwise.) So the performance when specifying "-qstrict" may be more representative of what can safely be obtained when using the ORNL SP3.
 
Our first experiments also included the -qhot option, enabling "higher order transformations". This degraded PSTSWM performance slightly, and gave bad results when used with the full atmospheric model (with or without the reentrant compile options).

Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:16:45 EDT.
3461 accesses since 1/2/96.