|
|
|
|
The Parallel Column Radiation Model (PCRM) is a version of the NCAR Column Radiation Model (CRM) that has been modified for benchmarking. CRM is a standalone version of the column radiation model used in version 3.6.6 of the Community Climate Model, and represents one of the major computational tasks in the column physics computations. To quote the CRM documentation,... the CRM is a physical process model which isolates the energetics of radiative transfer from the rest of the CCM3. The CRM is built from the radiation routines from CCM3, along with a simple text interface for the user to input information needed by the radiation calculation.The radiation calculations are vertical column based (in a longitude, latitude, vertical computational domain), and computations between vertical columns (different (longitude,latitude) indices) are independent. For exploiting vectorization, it can be important to bundle the computation of multiple columns. However, working with too many columns simultaneously can cause cache misses. To examine this performance issue, PCRM uses a compile-time tuning parameter to specify the number of the columns that are bundled in a single chunk:
(inner column index, vertical index)
The basic loop structure follows the index ordering:
DO J=1,NCHUNKS DO K=1,NVER DO I=1,NCOLS (radiation calculation) ENDDO ENDDO ENDDOHere,
- NCOLS is the number of columns in a chunk
- NVER is the number of vertical levels
- NCHUNKS is the number of chunks
In these experiments, the number of vertical levels is fixed at 18 and the total number of columns is fixed at 512. The size of NCOLS is varied, requiring a corresponding "inverse" variation in NCHUNKS. The issue to be examined is the performance sensitivity of computing a small number of columns at a time (NCOLS small) or a large number at a time (NCOLS large) for different compiler options. Good vectorization requires NCOLS to be large. Cache-based processor architectures may prefer NCOLS small. Setting NCOLS=1 is equivalent to setting the vertical level as the first index.
All results are presented in terms of MFlops/second, where the number of floating point operations was measured using the Speed Shop tools "ssrun -ideal" and "prof -archinfo" on an SGI Origin 2000. Compiler optimization was set at "-64 -O3". The results for a single 18 level column using the "cloudy day" input provided with CRM are as follows.
Levels floating point operations per column sqrt calls in flop count fdiv calls in flop count 18 987496 6.7% 3.4% As sqrt and fdiv are given the same weight as a floating point multiply or add in the count, using the computational rates to estimate percentage of peak is suspect. The rates are best viewed as a normalization of inverse runtime, where the same normalization is used for describing performance on all platforms.
The following compiler optimizations were examined
in the first graph.
- -C hopt
- -C vopt
- -C ssafe -C vsafe
![]()
From these results, both compiler options and vector length are crucial for high performance. In particular NCOLS=64 is required to achieve half of the peak observed performance, and NCOLS>=128 is strongly preferred. In these experiments the numerics were not affected by the choice of compiler optimization, and there is no indication that "-C hopt" should not be used.
The SX6 performance instrumentation option "-ftrace" indicates that the performance of the code is as high 4.6 GFlop/sec, 40% higher than the highest rate determined using the Speed Shop operation counts.
The next graph describes the same experiment for the "-C hopt" compiler option, but using one and eight processors simultaneously. That is, each processor is solving exactly the same problem, and we report the per processor performance. As each processor has a separate address space, the only way the processors interact is contention for access to memory. (For the eight processor run, there may also be contention with system processes for access to the processors.)
![]()
For these experiments performance degradation due to contention is measurable, but not debilitating (< 15%), except for NCOLS=128. We need to repeat this experiment to verify the NCOLS=128 anomaly.
The next experiment looks at the effect of specifying NCOLS at runtime. For performance tuning of a large application, it is important to determine parameters such as NCOLS empirically. If NCOLS can be set at runtime, a much smaller number of executables is required. The following graph compares the performance of setting NCOLS-dependent loop lengths and arrays dimensions at compile time, setting the loop lengths at runtime but setting a sufficiently large inner array dimension at compile time, and setting both loop lengths and array declarations at runtime.
![]()
As can be seen, setting NCOLS at runtime has almost no performance impact. This is significantly different than what is seen with IBM and Compaq high performance systems.
To provide context for these results, the following graphs describe optimal CRM performance for a number of different platforms:
For the multiple processor experiments, all processors in the SMP node are being used. Thus, for the SX-6, 8 processes are running simultaneously, while for the p690, 32 processes are running.
- NEC SX-5 at NEC
- IBM p690 32-way SMP node using POWER4 processors running at 1.3 GHz (at ORNL)
- Compaq ES45 4-way SMP using EV68 processors running at 1 GHz (at PSC )
- Compaq ES40 4-way SMP using EV67 processors running at 667MHz (at ORNL)
- IBM Winterhawk 4-way SMP node using POWER3 II processors running at 375MHz (at ORNL)
- SGI Origin 2000 128-way SMP node using MIPS processors running at 250MHz (at LANL)
![]()
![]()
![]()
From these results, the best SX-6 performance is more than a factor of 5 greater than the best performance on the non-vector platforms. When the chunk size drops below 16, then the performance of the non-vector systems is better. Running the problem on multiple processors simultaneously degrades performance somewhat on all systems, but doesn't change the quantitative comparison between the platforms.