put logo here
CSMD
people
people
sitemap
search

IBM p690 Compiler Experiments

Evaluation of Early Systems

As with most modern compilers, the IBM Fortran compiler xlf has many options. In the following experiments we examine the performance impact of a number of these options for two serial kernel codes. While far from exhaustive, these results provide some guidance on what performance can be gained (or lost) by the use of the different compiler options. The data described here were collected by Patrick H. Worley during January and Februrary of 2002.

PCRM

The first set of results describe the serial (1 processor) performance of the code PCRM for a number of different problem configurations and compiler options . PCRM is a slight modification of the NCAR Column Radiation Model (CRM), and is given a distinct name simply to denote that it differs from the stock distribution available from NCAR, and is likely to lag behind updates to CRM.

CRM is a standalone version of the column radiation model used in the Community Climate Model, and represents one of the major computational tasks in the column physics computations. To quote the CRM documentation,

... the CRM is a physical process model which isolates the energetics of radiative transfer from the rest of the CCM3. The CRM is built from the radiation routines from CCM3, along with a simple text interface for the user to input information needed by the radiation calculation.

The radiation calculations are vertical column based (in a longitude, latitude, vertical computational domain), and computations between vertical columns are independent. For exploiting vectorization, it can be important to bundle the computation of multiple columns. However, working with too many columns simultaneously can cause cache misses. To examine this performance issue, PCRM uses a compile-time tuning parameter to specify the number of the columns that are bundled in a single chunk:

(inner column index, vertical index)

The basic loop structure follows the index ordering:

DO J=1,NCHUNKS
  DO K=1,NVER
    DO I=1,NCOLS
      (radiation calculation)
    ENDDO
  ENDDO
ENDDO
Here,

In the experiments that follow, the total number of columns is fixed at 512. The size of NCOLS is varied, requiring a corresponding "inverse" variation in NCHUNKS. The issue to be examined is the performance sensitivity of computing a small number of columns at a time (NCOLS small) or a large number at a time (NCOLS large) for different compiler options. Good vectorization requires NCOLS to be large. Cache-based processor architectures may prefer NCOLS small. Setting NCOLS=1 is equivalent to setting the vertical level as the first index.

The following compiler options were examined

in the following graph.

Here, O3.qhot and O4 exhibit identical performance, as do O3 and O3.no_r. All compile options but O3.qhot and O4 are only mildly sensitive to NCOLS. The O3.qhot and O4 options allow the compiler to rearrange intrinsic functions calls and replace them with calls to vector intrinsic routines. This has a significant impact on the performance of this kernel, if NCOLS is large enough.

Another compiler option (-q64) enables support for 64-bit addressing. The following graphs examine the impact of this option. This flag currently works only with the reentrant compiler options (xlf_r).

From these results, it appears that compiler support for the vector intrinsics is missing when using 64-bit addressing. Until this is remedied, significant performance will be missing in 64-bit codes whose performance depends on good intrinsic function performance.

PSTSWM

The second set of results describe the serial (1 processor) performance of the code PSTSWM for a number of different problem sizes and compiler options. PSTSWM solves the shallow water equations on the sphere, and was originally designed as a testbed for parallel spectral algorithms on the sphere. It's serial performance, especially its memory access pattern, is similar to that of the spectral dynamical core used in NCAR global atmospheric models. By scaling the problem size (horizontal and vertical resolutions), different effects of the memory subsystem on performance can be ascertained. PSTSWM is dominated by floating point multiple/add operations, and can also take advantage of math library FFT routines.

The first twelve graphs plot the change in single processor performance (MFlops/second/processor) as the number of vertical levels changes for a fixed horizontal resolution. The indices for the computational arrays are (longitude, vertical, latitude), while most of the data dependencies are in the longitude and latitude directions. In consequence, as the number of vertical levels increases, the code spends more time going to memory.

The six horizontal resolutions used were

Each horizontal resolution requires approximately 4 times as much memory as the next smaller problem size, with T21 needing approx. 0.8MB of data space and T85 needing approx. 16MB for a single vertical level. To determine the total memory requirements, multiply the space needed for a single vertical level by the number of vertical levels.

PSTSWM was compiled using one of the following seven sets of compiler flags:

and linked with one of to either use ESSL FFT routines or Fortran equivalents included with PSTSWM.

In the graphs that follow, the legend orders the compiler options by performance (approximately), with the first listed producing the fastest runs.

xlf -O3 is the best compiler option overall. For larger problems sizes, xlf_r -O3 and xlf -O3 are indistinguishable. However, for small problem sizes, the code compiled with xlf -O3 performs much better than when compiled with xlf_r -O3.

The higher levels of optimization xlf -O3 -qhot and xlf -O4 do not improve the performance for this code. In contrast, the strict options xlf -O3 -qstrict and xlf_r -O3 -qstrict degrade performance significantly. For the most part, performance is identical for xlf_r -O3 and xlf_r -O3 {threadsafe}.

The use of the ESSL FFT routines improves performance, but does not change the compiler option comparisons. Sensitivity to the choice of compiler option is highest for the small problem instances. Once performance is dominated by memory bandwidth, the compiler options (other than the strict flag) have little impact.

The next set of graphs examine the performance impact of enabling 64-bit addressing via the -q64 compiler option. This flag currently works only with the reentrant compiler options (xlf_r). In the following, we compare performance using

For the larger horizontal problem sizes when using the ESSL FFT routines, 64 bit support appears to degrade performance compared to xlf_r -O3. In contrast, experiments using the Fortran FFT routines show little difference when compiling with and without 64 bit support. Our supposition is that the performance discrepancy is between the 64-bit and 32-bit versions of the ESSL libraries.

ornl | ccs | csm| disclaimer | search

URL http://www.csm.ornl.gov/evaluation/CHEETAH/CompilerTest.html
Updated: Tuesday, 26-Feb-2002 16:06:09 EST
webmaster