| home | about us | contact | ||||
![]() |
| |||
| CSM Home | ||||||||||||||||||||||||||||||||||||||
|
Evaluation of Early SystemsCommunity Atmospheric Model Performance BenchmarksSpectral Eulerian Dynamics at T42L26IntroductionThe Community Atmospheric Model (CAM) has two primary computational phases, the dynamics and the physical parameterizations (physics). CAM currently supports three different approximations to the dynamics, or dynamical cores,
on each target platform. However, no attempt was made to find better compiler optimizations. The platforms, compiler flags. and tuning options are described in more detail below. The following should be considered baseline performance. Improvements are expected as CAM is optimized on each platform. PlatformsResults are presented for four systems. All data were collected November and December of 2002 by Patrick H. Worley.
A few experiments were run to determine the impact of higher levels of optimization. For example, -O3 on the SGI improved serial performance by approximately 10%, but at the cost of losing bit-for-bit reproducibility with respect to the number of processors. Similar results hold for the other platforms. Further compiler optimizations need to be applied on a subroutine by subroutine basis to be effective. Note that the SGI shows a mild degradation in performance when enabling both MPI and OpenMP parallelism. For this reason, when running "MPI-only" the -mp compiler switch was not used. ExperimentExperiments were run using CAM2_0_1.dev10, which includes all of the performance improvements described in the CAM2 Performance Evolution web page. Performance data were collected for a 30 day simulation with model resolution T42L26, a 1200 second timestep, the default daily output to the atmosphere history file, and monthly output to the land history file and atmosphere and land restart files. Multiple runs were made, and the minimum timings are reported. It is noted in the accompanying text when the average is much higher than the minimum. Performance is cited in terms of the simulation years per day throughput metric, i.e. how many simulation years can be computed in 24 hours of computing. Results are presented for the best performance for a range of processor counts for each system. Timing was not started until after the model initialization was complete, immediately before beginning the computation of the first timestep. TuningAs described above, CAM has two primary computational phases, the dynamics and the physics. The physics has the higher serial complexity for the target problem resolutions, but it is also easily parallelized. Parallel algorithms for the dynamics require significantly more interprocessor communication than for the physics. In addition, the parallel implementation of the spectral Eulerian dynamics currently supports only a 1-D decomposition of the computational domain, limiting the number of usable processors to 64 for T42L26. By using hybrid MPI/OpenMP parallelism, more processors can be applied to the physics than to the dynamics, improving scalability. The physics is "column-based", with computation on different columns (longitude-latitude coordinates) being independent. The work required on a column varies with location and with both diurnal and seasonal cycles, but is relatively insensitive to other factors and static load balancing is quite effective. The basic computational unit in the physics in the "chunk", which is a subset of columns. The maximum number of columns in a given chunk is set at compile time. The choice of the chunk size and composition can affect both serial performance, due to the effect on memory access patterns, and parallel performance, as both MPI and OpenMP parallelism act at the chunk level. For example, smaller chunks imply more chunks and more exploitable parallelism. Global load balancing comes at the cost of increased interprocessor communication, remapping data from the domain decomposition used in the dynamics, but local load balancing that avoids interprocessor communication is also an option. For more details on chunk definition, see the CAM1 Performance tuning and benchmarking web page. Three optimization exercises were undertaken on each platform;
Platform ComparisonsThe following graphs describe the baseline performance of the Community Atmospheric Model on the target platforms when using the spectral Eulerian dynamical core with problem size T42L26. Note that when using more than 64 processors, CAM is necessarily run using at most 64 MPI processes and some number of OpenMP threads per process.
In these results,
Thus the p690 cluster and AlphaServerSC systems demonstrate nearly identical performance for this configuration of CAM, with the Origin 3000 and SP system significantly slower. Performance DiagnosisThe following figures describe the performance of CAM in more detail. Timings for the physics, spectral Eulerian dynamical core, land model, physics/dycore and physics/land interface routines, and I/O and timestep set-up are graphed separately. The interface routines are dominated by interprocessor communication. The I/O and set-up are dominated by I/O and serial bottlenecks. Note that these are log-log graphs, and differences are larger than they appear at first glance.
In summary, a number of additional performance enhancements have been identified, and we expect to implement these in the near future. |
|||||||||||||||||||||||||||||||||||||
|
ORNL
| Directorate
| CSM
| NCCS
| ORNL Disclaimer
| Search
Staff only: CSM computers | who, what, where? | news |
||||||||||||||||||||||||||||||||||||||
URL: http://www.csm.ornl.gov/evaluation/CAM/benchmark.eul.t42l26.html Updated: Monday, 24-Mar-2003 10:29:34 EST webmaster |
||||||||||||||||||||||||||||||||||||||