home  |  about us  |  contact  
 

 CSM Home  
 CSM Home

   

Evaluation of Early Systems


Tuning the Physics Data Structures in the Community Atmospheric Model (CAM)

Introduction

The following graphs describe chunk size tuning results on the Compaq AlphaServer SC (ES40 nodes with 667 MHz EV67 processors) and on an IBM SP (Winterhawk II nodes with 375 MHz Power3 II processors). The data was collected by Patrick H. Worley on the AlphaServer SC and IBM SP systems at Oak Ridge National Laboratory during October 2001 using version 3.12.46 of the NCAR Community Climate Model (CCM). Version 3.12.59 was the last version of the CCM, after which the model was renamed the Community Atmospheric Model (CAM). The modifications between version 3.12.46 and the current version of the model have minimal performance impacts, and these results accurately portray the performance of the CAM as of the middle of October 2001.

Chunks are the basic data structures used in the physics parameterization routines within the Community Atmospheric Model. The physics parameterizations are column based, and computations between vertical columns are independent. For exploiting vectorization (which is important in modern cache-based RISC processor architectures), it can be important to bundle the computation of multiple columns. However, working with too many columns simultaneously can cause cache misses. To support performance optimization, the model uses a chunk size tuning parameter to specify the number of the columns that are bundled in a single chunk. We expect the optimal chunk size to be a function of both processor architecture and problem resolution.

The data structures used in earlier versions of the code were designed to be compatible with the spectral Eulerian dynamical core (another part of the model), and were targeted for efficient performance on vector machines. The computational domain is a tensor product (longitude x latitude x vertical) grid covering the sphere. To exploit the vectorization possibilities, the original data structures defined the domain as

(longitude index, vertical index, latitude index)

with the basic loop structure following the index ordering:

DO J=1,NLAT
  DO K=1,NVER
    DO I=1,NLON
      (physical parameterizations)
    ENDDO
  ENDDO
ENDDO
Here,
  • NLON is the number of longitudes
  • NVER is the number of vertical levels
  • NLAT is the number of latitudes

(The CCM also supports grids in which the number of longitudes is a function of the latitude index, but that does not change the exposition significantly.)

Because computations are independent betweem columns, the inner "longitude" loop is vectorizable. The outer "latitude" loop is commonly where parallelism is exploited, via either MPI or OpenMP. The original data structure is also a "chunk" data structure, where there are NLAT chunks, each with NLON columns. However, in the current version of the model we support more general groupings of columns:

(PCOLS, NVER, NCHUNKS)

where

  • PCOLS is the maximum number of columns allocated to any chunk
  • NCOLS is the number of columns allocated to a given chunk
  • NVER is the number of vertical levels
  • NCHUNKS is the number of chunks

and the loop structure is

DO J=1,NCHUNKS
  DO K=1,NVER
    DO I=1,NCOLS(J)
      (physical parameterizations)
    ENDDO
  ENDDO
ENDDO

Thus, PCOLS*NCHUNKS > NLAT*NLON, but there are no other assumptions about the composition of a chunk. In particular, the columns bundled into a given chunk may not be geographically contiguous (although they will be typically). The inner loop is again the vectorization direction, and the outer loop is the MPI or OpenMP parallel direction. As the chunk size (PCOLS and NCOLS) decreases, the cache locality increases and the available parallelism exploitable at the outer looop level increases. In contrast, as the chunk size increases, the vectorization opportunities increase.

Tuning Experiments

We ran the model with the current production problem resolution of T42L26, corresponding to a 128 x 64 x 26 grid, or NLON=128, NLAT=64, and NVER=26. We set NCOLS= 1, 2, 4, 8, 16, 32, 64, 128 and measured the runtime for 30 simulation days. For T42L26 (NLON=128), using a power-of-two number of columns per chunk allows the same number of columns to be assigned to each chunk. However, array declarations using a power-of-two often lead to cache conflicts. To examine this issue we ran the experiments twice, once with NCOLS == PCOLS and once with NCOLS < PCOLS == the first prime number larger than NCOLS.

We ran the tuning experiments with 16 MPI processes and either 1 or 4 OpenMP threads per process. The Compaq and IBM systems are both clusters of 4-way SMP nodes. 16 nodes were used for both the 1 and 4 OpenMP thread experiments. Thus, in the 1 OpenMP thread experiments 3 processors per node were idle.

64 processors (16 nodes) was chosen for the tuning experiments as this is the largest processor count that can be used and still have at least one chunk per processor when using the "traditional" 128 column chunk. A large processor count was used to minimze the runtimes of the experiments. The 4 OpenMP thread experiment utilizes OpenMP parallelism over the "local" chunks, ranging from 4 to 128 for NCOLS=128 and NCOLS=1, respectively. The 1 OpenMP thread experiment does not use OpenMP parallelism, and computes the local chunks serially. In this latter case any difference in runtime will be due to differences in serial performance from the choice of chunk size (determined by vectorization and memory access patterns). Differences in the runtimes for the 4 OpenMP thread experiments will also include load balancing between chunks and OpenMP thread management overhead.

Three different dynamical cores are supported in the CAM: Eulerian spectral (EUL), Semi-Lagrangian spectral (SLD), and finite-volume (FV). Only SLD results are described below. All dynamical cores use the same physics routines, and the chunk size affects only physics data structures, therefore the optimal chunking size should be independent of the dynamical core used. As SLD uses a much larger dynamics timestep than EUL, physics is called more frequently relative to the dynamics, and the percentage of time spent in physics is larger than in EUL. This makes chunk size differences easier to identify. Note, however, EUL and FV were also used for a few tuning runs to validate the SLD results.

The first graph plots the tuning results when using 4 OpenMP threads on the AlphaServer SC. Both full model times and time spent in physics routines are plotted as a function of the chunk size NCOLS. The time spent in physics was measured using the CAM thread-safe timing library with the TIMING_BARRIERS option enabled, which inserts barriers between phases of the programs. The full model timings were collected with the TIMING_BARRIERS options disabled. In all cases, the model runtime did not increase significantly when running with TIMING_BARRIERS enabled.

With the exception of NCOLS=128, the difference in the model runtimes is solely a function of the difference in the time spents in the physics routines (as expected). The optimal chunk size on the AlphaServer SC is NCOLS = 8 or 16, and using an optimal chunk size improves performance over the traditional chunk definition be 15%.

The second graph plots the tuning results when using 1 OpenMP thread.

Again, the difference in the model runtimes is strictly a function of the difference in the time spent in the physics routines. A significant amount of the performance improvement from optimizing chunk size is due to improved serial performance (but not all), and the same optimal chunk sizes are indicated: NCOLS = 8 or 16.

The third and fourth graphs repeat the previous analyses using results taken from the IBM SP system.

As with the AlphaServer SC reults, the difference in runtime is solely due to differences in time spent in the physics routines, and much (but not all) of the variation is due to changes in the serial performance. The optimal chunk size on the IBM is NCOLS=16, which is similar to that of the AlphaServer SC. However, the sensitivites on the IBM are much different, with very small chunk sizes performing much worse than very large chunk sizes. Note that on the IBM, the code was compiled with "-O3 -qstrict". In earlier studies, we noted a large performance improvement in the radiation routines within the physics from using "-qhot", the vector intrinsic libraries, and moderately large chunks. This is not the basis of the performance difference here. The "-qstrict" options prevents the reordering required to make efficient use of the vector intrinsics.

Model Scalability

The redesign of the physics data structures was one step in a much larger project to improve the performance of CAM, by both improving serial performance and improving parallel scalability. The following graphs demonstrate that the chunking of the physics data structures was highly successful, and that further improvements in scalability require modifications to other parts of the model.

SLD on the AlphasServer SC

Along with the physics, other significant model components are the dynamics and the land surface model. The next 3 graphs portray three different views of the comparative performance of these three components using 8 column chunks and the SLD dynamical core on the AlphaServer SC. The first two graphs describe the runtime as a function of the number of MPI processes. For all experiments, one MPI process is assigned to each SMP node, and 4 OpenMP threads are used within the node. The first graph uses a linear scale for the time axis, while the second graph uses a logarithmic scale. The third graph plots the relative efficiency of each component relative to the execution time on 2 nodes (8 processors).

The graphs indicate excellent scalability for the physics. This is very important as the physics is the dominant component until more than 64 processors are used. The dynamics currently supports only a 1D domain decomposition, and can use at most 64 processors (16 nodes). However, the parallel efficiency of the dynamics is not very good when using more than 16 processors. The land surface model scales even worse than the dynamics, but is not a dominant component even for 128 processors (32 nodes).

EUL on the AlphasServer SC

The next 3 graphs describe scalability results for using the EUL dynamical core on the AlphaServer SC.

These results are qualitatively the same as for the SLD dynamical core. The EUL dynamical core scales somewhat better than SLD, but requires a smaller timestep, so is more costly despite the better scalability.

SLD on the SP

The next 3 graphs describe scalability results for using the SLD dynamical core on the IBM SP. We used 16 column chunks for the SP experiments.

Again, the physics component scales very well, while the other components do not. The primary difference between the results on the IBM and Compaq systems is that CLM component scales somewhat better on the IBM than on the Compaq.

EUL on the IBM SP

The next 3 graphs describe scalability results for using the EUL dynamical core on the IBM SP.

These data indicate the same results - physics scales very well, while the other components do not.

Model Comparison

For completenss sake, we compare the performance of the EUL and SLD dynamical cores on the AlphaServer SC and the IBM SP using both optimal and traditional chunk sizes. The SLD dynamical core uses a 3600 second timestep, while the EUL dynamical core uses a 1200 second timestep. Results are presented in terms of the number of simulation years that can be computed in a single day.

It is clear that even without introducing additional parallelism into the dynamics or the land surface model, the model performance is significantly enhanced by a careful choice of the chunking parameter. The relative improvement from optimzing chunk size is similar between the Compaq and IBM systems. However, the Compaq is 40%-50% faster than the IBM for these simulations. Note however that the model is still in development, and no significant effort has been made examine the effect of high levels of optimization on the code. As mentioned earlier, "-O3 -qstrict" was used on the IBM. On the Compaq, "-O3 -inline speed" was used.




   
  ORNL | Directorate | CSM | NCCS | ORNL Disclaimer | Search
Staff only: CSM computers | who, what, where? | news
 
URL: http://www.csm.ornl.gov/evaluation/CAM/tuning.html
Updated: Thursday, 10-Oct-2002 13:59:51 EDT

webmaster