PSTSWM POWER4 Serial Performance

Performance Studies using

PSTSWM


IBM POWER4 Serial Performance

(Using ESSL Math Library)

Date/Person: January 10, 2002 / P. Worley
Platform: IBM p690 system at Oak Ridge National Laboratory (cheetah.ccs.ornl.gov):
     1 32-way p690 SMP nodes (1.3 GHz POWER4, 4 8-way Mulitchip Modules)
Environment: AIX 5.1
Code Version: 6.8.2
Make Options: MACH=sp COMM=serial PRECISION=8 PERF=n WORKSPACE=22000000 MATH=essl
Compilation Options: mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto
    or mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto -qstrict
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave -qstrict
    or mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto -qhot
    or mpxlf -O4 -qarch=auto -qtune=auto -qcache=auto
Link Options: -bmaxdata:0x70000000
    or -bmaxdata:0x70000000 -qsmp=noauto
Number of steps: T5, T10, T21, T42: 241 or 481
T85: 49 or 97
T170: 49 or 97
Notes: using ESSL library routines for Fourier transforms and BLAS

Not Reentrant

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T5 0.00004 0.00007 0.00010 0.00048
T10 0.00014 0.00027 0.00040 0.00233
T21 0.00070 0.00141 0.00216 0.01554
T42 0.00423 0.00906 0.01414 0.08887
T85 0.02906 0.05995 0.08980 0.55784
T170 0.19702 0.41469 0.66038  
Problem L1 L2 L3 L16
T5 0.00003 0.00006 0.00009 0.00041
T10 0.00012 0.00022 0.00032 0.00191
T21 0.00051 0.00105 0.00160 0.01273
T42 0.00293 0.00658 0.01035 0.07010
T85 0.01902 0.04013 0.06054 0.45013
T170 0.12467 0.30319 0.49961  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T5 774.9 897.4 942.8 1043.4
T10 1043.7 1081.8 1102.3 1018.9
T21 1092.5 1081.9 1062.4 787.6
T42 954.8 891.7 857.6 727.4
T85 818.8 793.8 794.8 682.4
T170 763.9 725.9 683.7  
Problem L1 L2 L3 L16
T5 901.3 1032.2 1101.4 1233.5
T10 1281.8 1329.6 1369.2 1242.6
T21 1493.6 1453.2 1436.5 961.7
T42 1380.7 1228.7 1171.7 922.3
T85 1250.7 1185.7 1179.0 845.7
T170 1207.3 992.8 903.8  

Threadsafe and OpenMP Ready

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T5 0.00008 0.00011 0.00014 0.00051
T10 0.00018 0.00031 0.00044 0.00239
T21 0.00073 0.00145 0.00218 0.01550
T42 0.00427 0.00916 0.01417 0.08865
T85 0.02858 0.05920 0.08842 0.55001
T170 0.19430 0.41521 0.65016  
Problem L1 L2 L3 L16
T5 0.00007 0.00010 0.00012 0.00044
T10 0.00015 0.00026 0.00036 0.00199
T21 0.00055 0.00109 0.00164 0.01284
T42 0.00302 0.00664 0.01043 0.06937
T85 0.01902 0.03999 0.06055 0.43879
T170 0.12304 0.29344 0.48992  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T5 408.3 584.4 692.6 977.0
T10 836.3 958.9 1016.5 994.4
T21 1044.3 1052.1 1051.0 789.5
T42 945.9 882.5 855.7 729.3
T85 832.4 803.8 807.2 692.1
T170 774.6 725.0 694.5  
Problem L1 L2 L3 L16
T5 436.2 640.6 771.4 1135.9
T10 983.5 1156.8 1237.7 1194.6
T21 1395.2 1406.5 1401.0 953.0
T42 1340.1 1216.3 1161.9 931.9
T85 1251.2 1190.0 1178.8 867.6
T170 1223.2 1025.8 921.6  

More Aggressive Optimization

with -qhot, without -qstrict

with -O4, without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T5 0.00004 0.00006 0.00009 0.00043
T10 0.00011 0.00023 0.00034 0.00212
T21 0.00051 0.00111 0.00167 0.01364
T42 0.00295 0.00706 0.01082 0.07368
T85 0.01881 0.04152 0.06132 0.47087
T170 0.12235 0.30675 0.50424  
Problem L1 L2 L3 L16
T5 0.00004 0.00006 0.00009 0.00042
T10 0.00011 0.00023 0.00034 0.00212
T21 0.00051 0.00111 0.00165 0.01362
T42 0.00296 0.00701 0.01082 0.07359
T85 0.01883 0.04168 0.06054 0.46985
T170 0.12462 0.31054 0.50793  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T5 839.3 966.9 1063.7 1178.7
T10 1292.8 1277.9 1324.7 1116.8
T21 1511.3 1372.9 1373.5 897.3
T42 1370.8 1144.7 1120.5 877.5
T85 1264.7 1146.1 1164.0 808.5
T170 1230.1 981.3 895.5  
Problem L1 L2 L3 L16
T5 872.1 973.7 1060.3 1185.7
T10 1310.4 1278.8 1319.1 1116.4
T21 1502.3 1379.0 1386.7 898.4
T42 1365.9 1152.3 1120.0 878.5
T85 1263.6 1141.6 1179.1 810.2
T170 1207.8 969.3 889.0  

DISCUSSION

The threaded compile options (using "mpxlf_r", "-qsmp=noauto", "-qnosave") are the first step in exploiting shared memory parallelism. These experiments look at the cost of making these options the default. The results indicate that there is significant performance impact for the small problem sizes, but that the overhead is insignificant for large problem sizes. In production runs, the larger problems are more representative, and these options will be used in practice.
 
The "-qstrict" compile option ensures that optimizations enabled by "-O3" do not change the semantics of the program. As can be seen above, the "-qstrict" option has a significant impact on performance. The accuracy of the simulation calculated by PSTSWM is not degraded by not using "-qstrict". However, "-qstrict" may be needed to ensure reproducibility in the full atmospheric model CAM. So the performance when specifying "-qstrict" may be more representative of what can safely be obtained when using the ORNL Power4 system.
 
The last two experiments examine the advantage of applying higher order loop transformation (-O3 -qhot) and more aggressive optimizations (-O4). Applied globally, as done here, shows little performance difference when compared to the -O3 experiments.

Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:22:39 EDT.
82251 accesses since 1/2/96.