home  |  about us  |  contact  
 

 CSM Home  
 CSM Home

   

Evaluation of Early Systems


Community Atmospheric Model (CAM) Performance Benchmarks

Introduction

The following graphs describe the performance of the Community Atmospheric Model when run with the spectral Eulerian (EUL) and spectral Semi-Lagrangian (SLD) dynamical cores. Performance was optimized with respect to the

  • chunk size
  • physics load balancing
  • number of OpenMP threads per MPI process
on each target platform. However, no attempt was made to find better compiler optimizations. The platforms, compiler flags. and tuning options are described in more detail below. The following should be considered baseline performance. Improvements are expected as CAM is optimized on each platform.

Platforms

Results are presented for four systems. All data were collected November and December of 2002 by Patrick H. Worley.

  • HP / Compaq AlphaServer SC at Pittsburgh Supercomputing Center: 750 ES45 4-way SMP nodes (1GHz Alpha EV68) and a Quadrics QsNet interconnect with two interconnect interfaces per node; /scratch2 was used for input datasets and history and restart output files.

    System Software:
    Tru64 UNIX version 5.1A
    Compaq AlphaServer SC TS2.5
    Compaq Fortran Compiler X5.4A-1684-46B5P
    Compiler optimization flags:
    -omp -automatic -fpe3 -check omp_bindings -O3 -inline speed

  • IBM p690 cluster at ORNL: 27 p690 32-way SMP nodes (1.3 GHz POWER4) and an SP Switch2 with either two or eight switch interfaces per node; /tmp/gpfs750b was used for input datasets and history and restart output files

    System Software:
    AIX version 5.1.0.35
    Parallel Operating Environment version 3.2.0.10
    XL Fortran Compiler version 7.1.1.3
    Compiler optimization flags:
    -qarch=auto -qspillsize=2500 -g -O3 -qstrict -Q -qsmp=omp
    Runtime Flags:
    setenv MP_SHARED_MEMORY yes

  • IBM SP at ORNL: 176 Winterhawk II 4-way SMP nodes (375MHz POWER3-II) and an SP Switch with one switch interface per node; /tmp/gpfs600a was used for input datasets and history and restart output files

    System Software:
    AIX version 5.1.0.25
    Parallel Operating Environment version 3.2.0.9
    XL Fortran Compiler version 7.1.1.3
    Compiler optimization flags:
    -qarch=auto -qspillsize=2500 -g -O3 -qstrict -Q -qsmp=omp
    Runtime Flags:
    setenv MP_SHARED_MEMORY yes

  • SGI Origin 3000 at Los Alamos National Laboratory: 512-way SMP (500 MHz MIPS R14000); the home directory file system was used for input datasets and history and restart output files.

    System Software:
    IRIX64 version 6.5
    Message Passing Toolkit version 1.6
    MIPSpro Compilers version 7.3.1.3m
    Compiler optimization flags (MPI-only):
    -64 -O2
    Compiler optimization flags (MPI/OpenMP):
    -64 -O2 -mp -MP:dsm=OFF -MP:old_mp=OFF
    Runtime Flags for MPI:
    setenv MPI_BUFFER_MAX 2000
    setenv MPI_BUFS_PER_PROC 32
    Additional Runtime Flags for MPI/OpenMP:
    setenv MPI_OPENMP_INTEROP 1
    setenv MPI_DSM_PLACEMENT "threadroundrobin"

A few experiments were run to determine the impact of higher levels of optimization. For example, -O3 on the SGI improved serial performance by approximately 10%, but at the cost of losing bit-for-bit reproducibility with respect to the number of processors. Similar results hold for the other platforms. Further compiler optimizations need to be applied on a subroutine by subroutine basis to be effective. Note that the SGI shows a mild degradation in performance when enabling both MPI and OpenMP parallelism. For this reason, when running "MPI-only" the -mp compiler switch was not used.

Experiment

Experiments were run using a modified version of CAM2_0_1.dev10 that includes the same performance improvements in the Semi-Lagrangian dynamics that were introduced in the Eulerian dynamics in CAM2_0_1.dev3. (Go here for more details.) These will be checked in to the CAM repository in the near future.

Performance data were collected for a 30 day simulation with model resolution T42L26 and the default daily output to the atmosphere history file and monthly output to the land history file and atmosphere and land restart files. Runs with the Eulerian dynamical core used 1200 second timesteps. Runs with the Semi-Lagrangian dynamical core used 3600 second timesteps. Multiple runs were made, and the minimum timings are reported. It is noted in the accompanying text when the average is much higher than the minimum.

To isolate the effect of I/O, runs were also made without writing the history and restart files, but using the same settings (chunk size, load balancing, number of the threads per process) that minimzed the execution time when writing these files. Performance is cited in terms of the simulation years per day throughput metric, i.e. how many simulation years can be computed in a 24 hours of computing. Results are presented for the best performance for a range of processor counts for each system. Timing was not started until after the model initialization was complete, immediately before beginning the computation of the first timestep.

Tuning

CAM has two primary computational phases, the dynamics and the physics. The physics has the higher serial complexity for the target problem resolutions, but it is also easily parallelized. Parallel algorithms for the dynamics require significantly more interprocessor communication than for the physics. In addition, the parallel implementation of the spectral Eulerian and Semi-Lagrangian dynamics currently support only a 1-D decomposition of the computational domain, limiting the number of usable processors to 64 for T42L26. By using hybrid MPI/OpenMP parallelism, more processors can be applied to the physics than to the dynamics, improving scalability.

The physics is "column-based", with computation on different columns (longitude-latitude coordinates) being independent. The work required on a column varies with location and with both diurnal and seasonal cycles, but is relatively insensitive to other factors and static load balancing is quite effective. The basic computational unit in the physics in the "chunk", which is a subset of columns. The maximum number of columns in a given chunk is set at compile time. The choice of the chunk size and composition can affect both serial performance, due to the effect on memory access patterns, and parallel performance, as both MPI and OpenMP parallelism act at the chunk level. For example, smaller chunks imply more chunks and more exploitable parallelism. Optimal load balancing comes at the cost of increased interprocessor communication, remapping data from the domain decomposition used in the dynamics, but approximate load balancing that avoids interprocessor communication is also an option. For more details on chunk definition, see the CAM1 Performance tuning and benchmarking web page.

Three optimization exercises were undertaken on each platform;

  • Experiments were run to determine the optimal chunk size, both with and without optimal load balancing.
  • For each processor count, performance was measured both with and without optimal load balancing.
  • For each processor count, performance was measured for MPI-only, two threads per MPI process, four threads per MPI process, and eight threads per MPI process, where (number of MPI processes)*(number of threads per MPI process) equals the number of processors.

Platform Comparisons

The following graphs describe the baseline performance of the Community Atmospheric Model on the target platforms. They compare performance between platforms for the spectral Eulerian and spectral Semi-lagrangian dynamical cores, respectively. Note that when using more than 64 processors, CAM is necessarily run using at most 64 MPI processes and some number of OpenMP threads per process.

The primary difference in the platform comparisons between the spectral Eulerian and the spectral Semi-Lagrangian is the relative frequency of I/O. The timestep for SLD is 3 times larger than for EUL, but history updates still occur once per simulation day. Thus the ratio of the number of history updates to the number of timesteps is 3 times higher for SLD than for EUL. The history update consists of a gather to a single processor, which then does the write. This has a larger impact on the AlphaServer SC than on the IBM Power4 system. In particular, the load balanced physics decomposition increases the cost of the gather phase. On the IBM, the load balanced decomposition is still better overall. In contrast, on the AlphasServer SC the unbalanced decomposition has a smaller runtime, indicating that the gather is a significant performance bottleneck on this system. This performance difference (and performance degradation) can be decreased by moving to a mutliple writer implementation of the history update.

Dycore Comparisons

The following graphs compare the performance of the dynamical cores on each platform separately, both with and without I/O. For the AlphaServer SC system, timings without I/O but with physics load balancing are included for 128 and 256 processors, to demonstrate the impact of I/O on the choice of optimal algorithm. For all other processor counts, the load balanced physics decomposition is optimal on this system.

From these data it can be seen that the single writer implementation of history and restart file writes is a performance bottleneck for the higher processor counts on all systems but the Origin. This will only become worse once a 2D parallel implementation of the dynamics improves the dynamics performance scaling. While difficult to predict, running at higher resolutions increases the amount of data being gathered and written, and performance scalability will likely become even more sensitive to I/O.

The other noteworthy result from these experiments is that SLD is 40% - 100% faster than EUL on these systems, even with full I/O.




   
  ORNL | Directorate | CSM | NCCS | ORNL Disclaimer | Search
Staff only: CSM computers | who, what, where? | news
 
URL: http://www.csm.ornl.gov/evaluation/CAM/benchmark.html
Updated: Friday, 20-Dec-2002 11:44:26 EST

webmaster