PSTSWM POWER4 Serial Performance

Performance Studies using

PSTSWM


IBM POWER4 Serial Performance

Date/Person: January 10, 2002 / P. Worley
Platform: IBM p690 system at Oak Ridge National Laboratory (cheetah.ccs.ornl.gov):
     1 32-way p690 SMP node (1.3 GHz POWER4, 4 8-way Mulitchip Modules)
Environment: AIX 5.1
Code Version: 6.8.2
Make Options: MACH=sp COMM=serial PRECISION=8 PERF=n WORKSPACE=22000000 MATH=essl
Compilation Options: mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto
    or mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto -qstrict
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave
    or mpxlf_r -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=noauto -qnosave -qstrict
    or mpxlf -O3 -qarch=auto -qtune=auto -qcache=auto -qhot
    or mpxlf -O4 -qarch=auto -qtune=auto -qcache=auto
Link Options: -bmaxdata:0x70000000
    or -bmaxdata:0x70000000 -qsmp=noauto
Number of steps: T5, T10, T21, T42: 241 or 481
T85: 49 or 97
T170: 49 or 97
Notes: using Fortran routines for Fourier transforms and BLAS

Not Reentrant

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T5 0.00006 0.00010 0.00015 0.00074
T10 0.00020 0.00039 0.00057 0.00317
T21 0.00095 0.00192 0.00290 0.01980
T42 0.00510 0.01086 0.01743 0.11252
T85 0.03496 0.07352 0.11075 0.66863
T170 0.22273 0.47209 0.73521  
Problem L1 L2 L3 L16
T5 0.00005 0.00009 0.00013 0.00063
T10 0.00017 0.00033 0.00048 0.00277
T21 0.00073 0.00148 0.00226 0.01737
T42 0.00377 0.00833 0.01365 0.09412
T85 0.02481 0.05388 0.08142 0.56218
T170 0.14875 0.34582 0.57759  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T5 568.5 620.4 639.1 675.9
T10 751.0 764.5 774.3 747.8
T21 805.3 795.4 792.2 618.2
T42 792.1 744.3 695.5 574.6
T85 680.5 647.3 644.5 569.3
T170 675.7 637.6 614.1  
Problem L1 L2 L3 L16
T5 652.5 716.1 742.4 792.1
T10 892.9 910.5 924.4 855.8
T21 1053.5 1036.1 1016.4 704.6
T42 1071.2 969.7 888.3 686.9
T85 959.0 883.2 876.6 677.2
T170 1011.8 870.4 781.7  

Threadsafe and OpenMP Ready

with -qstrict

without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T5 0.00009 0.00014 0.00018 0.00078
T10 0.00023 0.00042 0.00060 0.00320
T21 0.00097 0.00193 0.00289 0.01958
T42 0.00506 0.01079 0.01699 0.10845
T85 0.03424 0.07180 0.10751 0.64751
T170 0.21787 0.45793 0.71611  
Problem L1 L2 L3 L16
T5 0.00008 0.00012 0.00016 0.00064
T10 0.00020 0.00035 0.00050 0.00274
T21 0.00075 0.00149 0.00225 0.01741
T42 0.00376 0.00839 0.01353 0.09455
T85 0.02490 0.05420 0.08212 0.56471
T170 0.14881 0.34592 0.57845  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T5 348.5 459.3 516.8 647.7
T10 641.7 708.5 735.1 740.6
T21 786.8 791.3 794.1 625.1
T42 798.9 749.2 713.6 596.1
T85 694.9 662.8 663.9 587.9
T170 690.8 657.3 630.5  
Problem L1 L2 L3 L16
T5 384.6 523.3 596.5 780.7
T10 748.8 845.2 882.9 866.2
T21 1021.8 1028.0 1018.2 703.1
T42 1073.9 962.7 895.7 683.8
T85 955.6 878.0 869.2 674.1
T170 1011.4 870.2 780.6  

More Aggressive Optimization

with -qhot, without -qstrict

with -O4, without -qstrict

MEASURED TIME PER TIMESTEP (SEC)

MEASURED TIME PER TIMESTEP (SEC)

Problem L1 L2 L3 L16
T5 0.00005 0.00009 0.00012 0.00062
T10 0.00016 0.00033 0.00048 0.00293
T21 0.00071 0.00151 0.00229 0.01794
T42 0.00370 0.00863 0.01383 0.09717
T85 0.02440 0.05520 0.08223 0.59060
T170 0.14866 0.36379 0.58891  
Problem L1 L2 L3 L16
T5 0.00005 0.00009 0.00012 0.00062
T10 0.00016 0.00033 0.00048 0.00292
T21 0.00071 0.00153 0.00228 0.01785
T42 0.00370 0.00876 0.01385 0.09701
T85 0.02445 0.05530 0.08130 0.58845
T170 0.14741 0.36055 0.58855  

MEASURED MFLOP/SEC RATES

MEASURED MFLOP/SEC RATES

Problem L1 L2 L3 L16
T5 695.1 731.4 769.6 812.2
T10 925.4 903.1 923.3 810.5
T21 1083.9 1011.1 1004.1 682.3
T42 1092.3 936.8 876.7 665.3
T85 975.2 862.0 868.0 644.6
T170 1012.4 827.4 766.7  
Problem L1 L2 L3 L16
T5 696.7 731.9 769.3 809.1
T10 926.5 904.4 922.6 811.3
T21 1081.4 1001.7 1006.9 685.8
T42 1092.7 922.9 875.4 666.4
T85 973.0 860.5 878.0 646.9
T170 1021.0 834.9 767.2  

DISCUSSION

The threaded compile options (using "mpxlf_r", "-qsmp=noauto", "-qnosave") are the first step in exploiting shared memory parallelism. These experiments look at the cost of making these options the default. The results indicate that there is significant performance impact for the small problem sizes, but that the overhead is insignificant for large problem sizes. In production runs, the larger problems are more representative, and these options will be used in practice.
 
The "-qstrict" compile option ensures that optimizations enabled by "-O3" do not change the semantics of the program. As can be seen above, the "-qstrict" option has a significant impact on performance. The accuracy of the simulation calculated by PSTSWM is not degraded by not using "-qstrict". However, "-qstrict" may be needed to ensure reproducibility in the full atmospheric model CAM. So the performance when specifying "-qstrict" may be more representative of what can safely be obtained when using the ORNL Power4 system.
 
The last two experiments examine the advantage of applying higher order loop transformation (-O3 -qhot) and more aggressive optimizations (-O4). There are negligible differences between the performance of (-O3 -qhot) and -O4, both of which are slight improvements over -O3.

Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:22:40 EDT.
3476 accesses since 1/2/96.