Here are my current results. To summarize, we have a programming "style" that should make supporting both runtime and compile-time loop bounds and array dimensions "trivial". I also have results that indicate that runtime loop bounds are not a performance hit on the IBM SP. My next task is to look at runtime vs. compile time array declarations, and to repeat all of this on the AlphaServer SC.
1) The first "result" is an observation by Bill Putman that it is trivial to support both the compile-time and run-time options. If the number of columns is declared in the "physics data structures module", which seems to be consistent with the rest of the design, then this can be defined as a parameter or a runtime variable without touching the rest of the code. That is, the number of columns is not passed into the physics routines - only a pointer to the data structure, the definition of which is defined in a module. If the number of columns per segment needs to vary, we can still use this approach by computing on bogus columns or using masks.
Note that I feel that this is a crucial observation. No matter how many tests we make now, we can not be sure that some platform in the near term will not have different behavior. By making "nlon" a global variable, we can address any performance issues when they arise without changing the body of the code.
2) "The Experiments"
- CRM experiments with 18 vertical levels (NVER).
- Arrays defined as (PLOND,NVER,PLAT).
- PLOND and PLAT declared at compile-time.
- Modified CRM to either read in PLON or define it at compile time.
The number of latitudes actually computed, rlat <= PLAT, is determined
at runtime.
- Running on ORNL SP (375 MHz POWER3-II processors with 8MB L2 cache).
- 2 experiments:
i) runtime
PLON=1 vs. compile-time PLON=1; PLOND=1,..,512
ii) runtime
PLON=1,..,512 vs. compile-time PLON=1,..,512; PLOND=PLON for both -O3 and
-O3 -qhot compiler options on the SP
This examines issues of runtime loop bounds and compile-time array size declarations. In the original compile-time experiments, -O3 produced consistently mediocre performance as PLON varied, while -O3, -qhot produced good performance when PLON > 16.
i) MFlops/sec PLON=1; PLOND=1,..,512; PLAT=512
PLOND runtime runtime
compile-time compile-time
-O3
-O3 -qhot -O3
-O3 -qhot
1
111
75
115
112
2
113
74
114
112
4
110
73
111
109
8
102
70
103
103
16
85
62
85
89
32
71
58
72
79
64
47
46
45
62
128
22
29
22
35
256
11
16
11
17
512
6
9
6
9
So, CRM performance IS cache sensitive on the IBM. If the columns being computed are widely separated in memory (PLON=1, PLOND >> 1) then performance becomes VERY poor. Compile-time and runtime performance is identical with -O3. With -O3 -qhot, compile-time is equivalent to -O3, while runtime -O3 -qhot performance is worse. (Not knowing loop bounds and poor cache locality leads -qhot optimizations astray?)
ii) MFlops/sec PLON=1,..,512; PLOND=PLON; PLAT=512,..,1
PLON runtime runtime
compile-time compile-time
-O3 -O3 -qhot
-O3
-O3 -qhot
1
111
75
115
112
2
119
109
120
115
4
123
166
123
123
8
124
192
124
166
16
123
235
123
236
32
122
273
122
269
64
120
282
120
278
128
117
274
117
276
256
116
269
-
265
512
113
263
113
259
Runtime and compile-time performance are again identical as the number of columns in the segment varies (PLON loop bound increases), and is relatively insensitive to this variation. -O3 -qhot is still very sensitive to this parameter, and compile-time -qhot does a better job for small PLON. However, once PLON is large enough that the improved -qhot performance becomes evident, runtime performance is equivalent to that of compile-time.
June 2000