PSTSWM SP3-200 Serial Performance

Performance Studies using

PSTSWM


IBM SP3-200 Winterhawk I Serial Performance

Date/Person: October 12, 1999 / P. Worley
Platform: IBM SP3 at Oak Ridge National Laboratory (morgan.ccs.anl.gov):
     62 2-way Winterhawk I SMP nodes (200 MHz POWER3 with 4MB L2 cache, equivalent to RS/6000 Model 260)
Environment: AIX 4.3.2;   PSSP 3.1
Code Version: 6.4.4
Make Options: MACH=sp COMM=mpi PRECISION=8 PERF=n WORKSPACE=20000000
Compilation Options: mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto
    or mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto -qstrict
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qnosave -qsmp=noauto
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave -qstrict
Link Options: -bmaxdata:0x70000000
    or -bmaxdata:0x70000000 -qsmp=noauto
Number of steps: T42: 241 or 481
T85: 49 or 97
T170: 49 or 97
Notes: using PSTSWM Fortran routines for Fourier transforms and BLAS

Not Reentrant

Compiled with -qstrict

Compiled without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.027 0.056 0.087 0.522
T85 0.164 0.338 0.520 2.957
T170 1.058 2.157 3.269  
Problem L1 L2 L3 L16
T42 0.022 0.044 0.068 0.436
T85 0.116 0.243 0.379 2.304
T170 0.705 1.501 2.314  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 151.6 148.6 142.3 126.5
T85 147.9 143.3 139.8 131.1
T170 144.6 141.9 140.4  
Problem L1 L2 L3 L16
T42 192.0 189.4 183.5 151.4
T85 208.2 199.3 191.7 168.3
T170 217.1 203.9 198.4  

Reentrant

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T42 0.028 0.058 0.090 0.544
T85 0.169 0.348 0.532 3.046
T170 1.086 2.209 3.347  
Problem L1 L2 L3 L16
T42 0.022 0.045 0.070 0.441
T85 0.121 0.252 0.392 2.388
T170 0.731 1.539 2.346  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T42 148.2 141.3 137.2 121.6
T85 143.5 139.1 136.7 127.3
T170 140.9 138.5 137.2  
Problem L1 L2 L3 L16
T42 189.3 182.4 175.8 149.7
T85 200.8 192.3 185.7 162.4
T170 209.4 198.9 195.6  

DISCUSSION

The reentrant compile options (using "mpxlf_r", "-qsmp=noauto", "-qnosave") are the first step in exploiting shared memory parallelism. These experiments look at the cost of making these options the default. The results indicate that the impact is less than 5%.
 
The "-qstrict" compile option ensures that optimizations enabled by "-O3" do not change the semantics of the program. As can be seen above, the "-qstrict" option has a significant impact on performance. The accuracy of the simulation calculated by PSTSWM is not degraded by not using "-qstrict". However, "-qstrict" may be needed to ensure reproducibility in the full atmospheric model CCM/MP-2D, and it is currently required when using the "reentrant" compile options with CCM/MP-2D. (The code aborts otherwise.) So the performance when specifying "-qstrict" may be more representative of what can safely be obtained when using the ORNL SP3.
 
Our first experiments also included the -qhot option, enabling "higher order transformations". This degraded PSTSWM performance slightly, and gave bad results when used with the full atmospheric model (with or without the reentrant compile options).

Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:16:46 EDT.
3391 accesses since 1/2/96.