home  |  about us  |  contact  
 

 CSM Home  
 CSM Home

   

Evaluation of Early Systems


Evolution of Performance in the Community Atmospheric Model (CAM)

Introduction

The following graphs describe the performance evolution of the Community Atmospheric Model when run with the spectral Semi-Lagrangian dynamical core. Over the past few years, a number of modifications have been made to the CAM with the goal of improving performance. With two exceptions, described below, these changes were introduced as options that can be disabled. This allows us to examine the impact of these changes and the evolution of CAM performance. The data presented here were collected by Patrick H. Worley on the the IBM p690 cluster at Oak Ridge National Laboratory during Novermber of 2002. Version 2.0 of CAM was used for all but two of the experiments. The problem size was T42L26, corresponding to a 128 by 64 by 26 computational grid, using a 60 minute timestep. Experiments were run without writing the history and restart files. Benchmark timings that include standard production I/O are described in the CAM2 Performance Benchmarks.

CAM has two primary computational phases, the dynamics and the physics. The physics has the higher serial complexity for the target problem resolutions, but it is also easily parallelized. Parallel algorithms for the dynamics require significantly more interprocessor communication than for the physics. In addition, the parallel implementation of the spectral Eulerian dynamics currently supports only a 1-D decomposition of the computational domain, limiting the number of usable processors to 64 for T42L26. By using hybrid MPI/OpenMP parallelism, more processors can be applied to the physics than to the dynamics, improving scalability.

The physics is "column-based", with computation on different columns (longitude-latitude coordinates) being independent. The work required on a column varies with location and with both diurnal and seasonal cycles, but is relatively insensitive to other factors and static load balancing is quite effective. The primary code modification examined here is a redefinition of the basic data structures used in the physics, allowing the basic unit of work, the number and choice of columns assigned to a "chunk", to be determined at runtime. The maximum number of columns in a given chunk is set at compile time. The choice of the chunk size and composition can affect both serial performance, due to the effect on memory access patterns, and parallel performance, as both MPI and OpenMP parallelism act at the chunk level. For example, smaller chunks imply more chunks and more exploitable parallelism. Optimal load balancing comes at the cost of increased interprocessor communication, remapping data from the domain decomposition used in the dynamics, but approximate load balancing that avoids interprocessor communication is also an option. For more details on chunk definition, see the CAM1 Performance tuning and benchmarking web page.

The experiments look at performance for a number of different algorithmic options:

  • Original latitude-slice decomposition using 128 column chunks, MPI-only
  • 16 column chunks, approximately load-balanced within the dynamics latitude-slice domain-decomposition, MPI-only
  • 16 column chunks, optimal static load balancing of chunks, MPI-only
  • 16 column chunks, optimal static load balancing of chunks, hybrid MPI/OpenMP
Note that hybrid MPI/OpenMP has always been part of the CAM design. However, pure MPI typically performs as efficiently as hybrid MPI/OpenMP when using a 1-D domain decomposition, 128-column chunks, and the same total number of processors. Using MPI/OpenMP and smaller chunks, we can use additional processors in the physics when running on clusters of SMP-nodes, leaving these additional processors idle during the dynamics. This is reflected in the higher processor counts for the hybrid cases in the graphs. The number of threads per MPI process in the hybrid experiments varies with processor count, and the hybrid experiments include the MPI-only option as a special case. All thread counts were tried, and only the best performance is reported here.

Two additional experiments were run. In CAM2.0, the CLM2 land model communicates with the atmosphere via a gather/scatter algorithm, gathering all necessary data into a single processor, remapping it to the appropriate data structure, then scattering back to the other processors. This has recently been replaced with an alltoall-based algorithm, eliminating the single processor bottleneck and decreasing the interprocessor communication overhead. This update replaced code in CAM2.0, and is not an option that can be enabled and disabled.

  • 16 column chunks, optimal static load balancing of chunks, hybrid MPI/OpenMP, updated CLM2/atmosphere interchange.
The 1-D decomposition used in the dynamics parallel algorithm is in the process of being replaced by a 2-D decomposition. As part of this effort, we cleaned up the current 1-D algorithm, eliminating some unnecessary interprocessor communication. This update is applied on top of the modification to the land/atmosphere interchange, and is also not an option that can be enabled and disabled.
  • 16 column chunks, optimal static load balancing of chunks, hybrid MPI/OpenMP, updated CLM2/atmosphere interchange, updated dynamics algorithm.

The following graph describes the performance from the above experiments when run on the p690 cluster. The p690 is a 32-way SMP node, with a limitation of 2 network cards per OS image. However, the 32-way system can be partitioned into 4 8-way "logical" SMP nodes (LPARs), each of which can be populated with 2 network cards. This increases the bandwidth for internode communication significantly. In the following graph we used a single 32-way SMP node for up to 32 processors, then ran on the 8-way logical SMP nodes for 64 or more processors.

As can be seen, the performance improvement due to the new data structure first becomes significant when using 32 processors and load balancing. Using MPI/OpenMP then makes it possible to use much larger processor counts. The improved dynamics algorithms and CLM2 interchange also have significant impacts, enabling reasonable scaling out to 256 processors.

Moving from a single 32-way SMP node to the 8-way LPARs, requiring communication over the switch rather than via shared-memory, prevented performance improvement for the MPI-only experiments when going to 64 processors. The hybrid algorithm continues to show performance improvement despite the transition, especially when using the updated dynamics and CLM2 interchange, which have significantly less interprocessor communication.




   
  ORNL | Directorate | CSM | NCCS | ORNL Disclaimer | Search
Staff only: CSM computers | who, what, where? | news
 
URL: http://www.csm.ornl.gov/evaluation/CAM/progress.sld.html
Updated: Monday, 24-Mar-2003 10:33:22 EST

webmaster