home  |  about us  |  contact  
 

 CSM Home  
 CSM Home

   

Evaluation of Early Systems


Community Atmospheric Model Performance Benchmarks

Spectral Eulerian Dynamics at T42L26

Introduction

The Community Atmospheric Model (CAM) has two primary computational phases, the dynamics and the physical parameterizations (physics). CAM currently supports three different approximations to the dynamics, or dynamical cores,

  • spectral Eulerian (EUL),
  • finite volume semi-Lagrangian (FV), and
  • spectral semi-Lagrangian (SLD),
all using the the same physics. The choice is specified at compile time. The following graphs describe the performance of the Community Atmospheric Model when run with the spectral Eulerian dynamical core and a T42L26 problem size. (T42L26 corresponds to a 128x64 horizontal grid and 26 vertical levels.) Performance was optimized with respect to the

  • chunk size
  • load balancing of the physics computation
  • number of OpenMP threads per MPI process

on each target platform. However, no attempt was made to find better compiler optimizations. The platforms, compiler flags. and tuning options are described in more detail below. The following should be considered baseline performance. Improvements are expected as CAM is optimized on each platform.

Platforms

Results are presented for four systems. All data were collected November and December of 2002 by Patrick H. Worley.

  • HP / Compaq AlphaServer SC at Pittsburgh Supercomputing Center: 750 ES45 4-way SMP nodes (1GHz Alpha EV68) and a Quadrics QsNet interconnect with two interconnect interfaces per node; /scratch2 was used for input datasets and history and restart output files.

    System Software:
    Tru64 UNIX version 5.1A
    Compaq AlphaServer SC TS2.5
    Compaq Fortran Compiler X5.4A-1684-46B5P
    Compiler optimization flags:
    -omp -automatic -fpe3 -check omp_bindings -O3 -inline speed

  • IBM p690 cluster at ORNL: 27 p690 32-way SMP nodes (1.3 GHz POWER4) and an SP Switch2 with either two or eight switch interfaces per node; /tmp/gpfs750b was used for input datasets and history and restart output files

    System Software:
    AIX version 5.1.0.35
    Parallel Operating Environment version 3.2.0.10
    XL Fortran Compiler version 7.1.1.3
    Compiler optimization flags:
    -qarch=auto -qspillsize=2500 -g -O3 -qstrict -Q -qsmp=omp
    Runtime Flags:
    setenv MP_SHARED_MEMORY yes

  • IBM SP at ORNL: 176 Winterhawk II 4-way SMP nodes (375MHz POWER3-II) and an SP Switch with one switch interface per node; /tmp/gpfs600a was used for input datasets and history and restart output files

    System Software:
    AIX version 5.1.0.25
    Parallel Operating Environment version 3.2.0.9
    XL Fortran Compiler version 7.1.1.3
    Compiler optimization flags:
    -qarch=auto -qspillsize=2500 -g -O3 -qstrict -Q -qsmp=omp
    Runtime Flags:
    setenv MP_SHARED_MEMORY yes

  • SGI Origin 3000 at Los Alamos National Laboratory: 512-way SMP (500 MHz MIPS R14000); the home directory file system was used for input datasets and history and restart output files.

    System Software:
    IRIX64 version 6.5
    Message Passing Toolkit version 1.6
    MIPSpro Compilers version 7.3.1.3m
    Compiler optimization flags (MPI-only):
    -64 -O2
    Compiler optimization flags (MPI/OpenMP):
    -64 -O2 -mp -MP:dsm=OFF -MP:old_mp=OFF
    Runtime Flags for MPI:
    setenv MPI_BUFFER_MAX 2000
    setenv MPI_BUFS_PER_PROC 32
    Additional Runtime Flags for MPI/OpenMP:
    setenv MPI_OPENMP_INTEROP 1
    setenv MPI_DSM_PLACEMENT "threadroundrobin"

A few experiments were run to determine the impact of higher levels of optimization. For example, -O3 on the SGI improved serial performance by approximately 10%, but at the cost of losing bit-for-bit reproducibility with respect to the number of processors. Similar results hold for the other platforms. Further compiler optimizations need to be applied on a subroutine by subroutine basis to be effective. Note that the SGI shows a mild degradation in performance when enabling both MPI and OpenMP parallelism. For this reason, when running "MPI-only" the -mp compiler switch was not used.

Experiment

Experiments were run using CAM2_0_1.dev10, which includes all of the performance improvements described in the CAM2 Performance Evolution web page. Performance data were collected for a 30 day simulation with model resolution T42L26, a 1200 second timestep, the default daily output to the atmosphere history file, and monthly output to the land history file and atmosphere and land restart files. Multiple runs were made, and the minimum timings are reported. It is noted in the accompanying text when the average is much higher than the minimum.

Performance is cited in terms of the simulation years per day throughput metric, i.e. how many simulation years can be computed in 24 hours of computing. Results are presented for the best performance for a range of processor counts for each system. Timing was not started until after the model initialization was complete, immediately before beginning the computation of the first timestep.

Tuning

As described above, CAM has two primary computational phases, the dynamics and the physics. The physics has the higher serial complexity for the target problem resolutions, but it is also easily parallelized. Parallel algorithms for the dynamics require significantly more interprocessor communication than for the physics. In addition, the parallel implementation of the spectral Eulerian dynamics currently supports only a 1-D decomposition of the computational domain, limiting the number of usable processors to 64 for T42L26. By using hybrid MPI/OpenMP parallelism, more processors can be applied to the physics than to the dynamics, improving scalability.

The physics is "column-based", with computation on different columns (longitude-latitude coordinates) being independent. The work required on a column varies with location and with both diurnal and seasonal cycles, but is relatively insensitive to other factors and static load balancing is quite effective. The basic computational unit in the physics in the "chunk", which is a subset of columns. The maximum number of columns in a given chunk is set at compile time. The choice of the chunk size and composition can affect both serial performance, due to the effect on memory access patterns, and parallel performance, as both MPI and OpenMP parallelism act at the chunk level. For example, smaller chunks imply more chunks and more exploitable parallelism. Global load balancing comes at the cost of increased interprocessor communication, remapping data from the domain decomposition used in the dynamics, but local load balancing that avoids interprocessor communication is also an option. For more details on chunk definition, see the CAM1 Performance tuning and benchmarking web page.

Three optimization exercises were undertaken on each platform;

  • Experiments were run to determine the optimal chunk size, both with and without global load balancing.
  • For each processor count, performance was measured both with and without global load balancing.
  • For each processor count, performance was measured for MPI-only, two threads per MPI process, four threads per MPI process, and eight threads per MPI process, where (number of MPI processes)*(number of threads per MPI process) equals the number of processors.

Platform Comparisons

The following graphs describe the baseline performance of the Community Atmospheric Model on the target platforms when using the spectral Eulerian dynamical core with problem size T42L26. Note that when using more than 64 processors, CAM is necessarily run using at most 64 MPI processes and some number of OpenMP threads per process.

In these results,

  • 16 column chunks were used for the IBM and HP timings, while 32 column chunks were used for the SGI timings;
  • global load balancing was used for the IBM and SGI timings, and for the HP timings when using 32 or fewer processors. When using more processors, local load balancing was more efficient on the HP system.

Thus the p690 cluster and AlphaServerSC systems demonstrate nearly identical performance for this configuration of CAM, with the Origin 3000 and SP system significantly slower.

Performance Diagnosis

The following figures describe the performance of CAM in more detail. Timings for the physics, spectral Eulerian dynamical core, land model, physics/dycore and physics/land interface routines, and I/O and timestep set-up are graphed separately. The interface routines are dominated by interprocessor communication. The I/O and set-up are dominated by I/O and serial bottlenecks. Note that these are log-log graphs, and differences are larger than they appear at first glance.

  • Up to 32 processors, the physics dominates the model performance. Improving the serial performance of the physics could improve model performance significantly on all of the systems. On three of the systems, the physics scale perfectly out to 256 processors. However, there is a performance anomaly on the IBM when moving from 128 to 256 processors, probably linked to the fact that 4 OpenMP threads are used for the 128 processor timing while 8 OpenMP threads are used for the 256 processor timing.
  • For more than 32 processors, the dycore begins to limit CAM performance. The dycore performance essentially stops improving when using more than 64 processors. Implementing a two-dimensional domain decomposition should address this problem.
  • While the land is not scaling perfectly, it is scaling well enough that it does not limit CAM performance.
  • While there is some variation, the interface routines scale reasonably well on all platforms but the HP, and the interprocessor communication in the interface routines is not a performance limiter. On the HP, the performance of the interface routines is a problem, and global load balancing is not optimal for larger processor counts. (Since global load balancing is not used for large processor counts, the cost of the interface routines in the above graph is lower than it would be otherwise.) This performance problem may be due to an inefficient implementation of the MPI collective command MPI_ALLTOTALLV. An alternative implementation will be examined to try to address this problem on the HP.
  • The I/O and the timestep set-up routines both have significant serial bottlenecks. A new parallel I/O library is being developed, and some of the serial bottlenecks in the timestep set-up routines will be eliminated in a future modification. Note that the performance on the HP is also affected by the use of other MPI collectives that appear to perform poorly. Alternative implementations of these routines will also be examined.

In summary, a number of additional performance enhancements have been identified, and we expect to implement these in the near future.




   
  ORNL | Directorate | CSM | NCCS | ORNL Disclaimer | Search
Staff only: CSM computers | who, what, where? | news
 
URL: http://www.csm.ornl.gov/evaluation/CAM/benchmark.eul.t42l26.html
Updated: Monday, 24-Mar-2003 10:29:34 EST

webmaster