| home | about us | contact | ||||
![]() |
| |||
| CSM Home | |||||||||||||||||||||||||||||||||||||||
|
Evaluation of Early SystemsCommunity Atmospheric Model (CAM) Performance BenchmarksIntroductionThe following graphs describe the performance of the Community Atmospheric Model when run with the spectral Eulerian (EUL) and spectral Semi-Lagrangian (SLD) dynamical cores. Performance was optimized with respect to the
PlatformsResults are presented for four systems. All data were collected November and December of 2002 by Patrick H. Worley.
ExperimentExperiments were run using a modified version of CAM2_0_1.dev10 that includes the same performance improvements in the Semi-Lagrangian dynamics that were introduced in the Eulerian dynamics in CAM2_0_1.dev3. (Go here for more details.) These will be checked in to the CAM repository in the near future. Performance data were collected for a 30 day simulation with model resolution T42L26 and the default daily output to the atmosphere history file and monthly output to the land history file and atmosphere and land restart files. Runs with the Eulerian dynamical core used 1200 second timesteps. Runs with the Semi-Lagrangian dynamical core used 3600 second timesteps. Multiple runs were made, and the minimum timings are reported. It is noted in the accompanying text when the average is much higher than the minimum. To isolate the effect of I/O, runs were also made without writing the history and restart files, but using the same settings (chunk size, load balancing, number of the threads per process) that minimzed the execution time when writing these files. Performance is cited in terms of the simulation years per day throughput metric, i.e. how many simulation years can be computed in a 24 hours of computing. Results are presented for the best performance for a range of processor counts for each system. Timing was not started until after the model initialization was complete, immediately before beginning the computation of the first timestep. TuningCAM has two primary computational phases, the dynamics and the physics. The physics has the higher serial complexity for the target problem resolutions, but it is also easily parallelized. Parallel algorithms for the dynamics require significantly more interprocessor communication than for the physics. In addition, the parallel implementation of the spectral Eulerian and Semi-Lagrangian dynamics currently support only a 1-D decomposition of the computational domain, limiting the number of usable processors to 64 for T42L26. By using hybrid MPI/OpenMP parallelism, more processors can be applied to the physics than to the dynamics, improving scalability. The physics is "column-based", with computation on different columns (longitude-latitude coordinates) being independent. The work required on a column varies with location and with both diurnal and seasonal cycles, but is relatively insensitive to other factors and static load balancing is quite effective. The basic computational unit in the physics in the "chunk", which is a subset of columns. The maximum number of columns in a given chunk is set at compile time. The choice of the chunk size and composition can affect both serial performance, due to the effect on memory access patterns, and parallel performance, as both MPI and OpenMP parallelism act at the chunk level. For example, smaller chunks imply more chunks and more exploitable parallelism. Optimal load balancing comes at the cost of increased interprocessor communication, remapping data from the domain decomposition used in the dynamics, but approximate load balancing that avoids interprocessor communication is also an option. For more details on chunk definition, see the CAM1 Performance tuning and benchmarking web page. Three optimization exercises were undertaken on each platform;
Platform ComparisonsThe following graphs describe the baseline performance of the Community Atmospheric Model on the target platforms. They compare performance between platforms for the spectral Eulerian and spectral Semi-lagrangian dynamical cores, respectively. Note that when using more than 64 processors, CAM is necessarily run using at most 64 MPI processes and some number of OpenMP threads per process.
The primary difference in the platform comparisons between the spectral Eulerian and the spectral Semi-Lagrangian is the relative frequency of I/O. The timestep for SLD is 3 times larger than for EUL, but history updates still occur once per simulation day. Thus the ratio of the number of history updates to the number of timesteps is 3 times higher for SLD than for EUL. The history update consists of a gather to a single processor, which then does the write. This has a larger impact on the AlphaServer SC than on the IBM Power4 system. In particular, the load balanced physics decomposition increases the cost of the gather phase. On the IBM, the load balanced decomposition is still better overall. In contrast, on the AlphasServer SC the unbalanced decomposition has a smaller runtime, indicating that the gather is a significant performance bottleneck on this system. This performance difference (and performance degradation) can be decreased by moving to a mutliple writer implementation of the history update. Dycore ComparisonsThe following graphs compare the performance of the dynamical cores on each platform separately, both with and without I/O. For the AlphaServer SC system, timings without I/O but with physics load balancing are included for 128 and 256 processors, to demonstrate the impact of I/O on the choice of optimal algorithm. For all other processor counts, the load balanced physics decomposition is optimal on this system.
From these data it can be seen that the single writer implementation of history and restart file writes is a performance bottleneck for the higher processor counts on all systems but the Origin. This will only become worse once a 2D parallel implementation of the dynamics improves the dynamics performance scaling. While difficult to predict, running at higher resolutions increases the amount of data being gathered and written, and performance scalability will likely become even more sensitive to I/O. The other noteworthy result from these experiments is that SLD is 40% - 100% faster than EUL on these systems, even with full I/O. |
||||||||||||||||||||||||||||||||||||||
|
ORNL
| Directorate
| CSM
| NCCS
| ORNL Disclaimer
| Search
Staff only: CSM computers | who, what, where? | news |
|||||||||||||||||||||||||||||||||||||||
URL: http://www.csm.ornl.gov/evaluation/CAM/benchmark.html Updated: Friday, 20-Dec-2002 11:44:26 EST webmaster |
|||||||||||||||||||||||||||||||||||||||