| home | about us | contact | ||||
![]() |
| |||
| CSM Home | ||||||||||||||||||||||||||||||||||||||
|
Evaluation of Early SystemsEvolution of Performance in the Community Atmospheric Model (CAM)IntroductionThe following graphs describe the performance evolution of the Community Atmospheric Model when run with the spectral Eulerian dynamical core. Over the past few years, a number of modifications have been made to the CAM with the goal of improving performance. With two exceptions, described below, these changes were introduced as options that can be disabled. This allows us to examine the impact of these changes and the evolution of CAM performance. The data presented here were collected by Patrick H. Worley on the the IBM p690 cluster at Oak Ridge National Laboratory and on the HP/Compaq AlphaServer SC cluster (lemieux.psc.edu) at Pittsburgh Supercomputer Center during September and October of 2002. Version 2.0 of CAM was used for all but two of the experiments. The problem size was T42L26, corresponding to a 128 by 64 by 26 computational grid, using a 20 minute timestep. Experiments were run without writing the history and restart files. Benchmark timings that include standard production I/O are described in the CAM2 Performance Benchmarks.CAM has two primary computational phases, the dynamics and the physics. The physics has the higher serial complexity for the target problem resolutions, but it is also easily parallelized. Parallel algorithms for the dynamics require significantly more interprocessor communication than for the physics. In addition, the parallel implementation of the spectral Eulerian dynamics currently supports only a 1-D decomposition of the computational domain, limiting the number of usable processors to 64 for T42L26. By using hybrid MPI/OpenMP parallelism, more processors can be applied to the physics than to the dynamics, improving scalability. The physics is "column-based", with computation on different columns (longitude-latitude coordinates) being independent. The work required on a column varies with location and with both diurnal and seasonal cycles, but is relatively insensitive to other factors and static load balancing is quite effective. The primary code modification examined here is a redefinition of the basic data structures used in the physics, allowing the basic unit of work, the number and choice of columns assigned to a "chunk", to be determined at runtime. The maximum number of columns in a given chunk is set at compile time. The choice of the chunk size and composition can affect both serial performance, due to the effect on memory access patterns, and parallel performance, as both MPI and OpenMP parallelism act at the chunk level. For example, smaller chunks imply more chunks and more exploitable parallelism. Optimal load balancing comes at the cost of increased interprocessor communication, remapping data from the domain decomposition used in the dynamics, but approximate load balancing that avoids interprocessor communication is also an option. For more details on chunk definition, see the CAM1 Performance tuning and benchmarking web page.
The experiments look at performance for a number of different algorithmic options:
Two additional experiments were run. The 1-D decomposition used in the dynamics parallel algorithm is in the process of being replaced by a 2-D decomposition. As part of this effort, we cleaned up the current 1-D algorithm, eliminating some unnecessary interprocessor communication. This update replaced code in CAM2.0, and is not an option that can be enabled and disabled.
The first graph describes the performance from the above experiments when run on the AlphaServer SC.
As can be seen, the performance improvement due to the new data structure is not significant until using MPI/OpenMP and larger processor counts. At this point, it becomes significant. The improved dynamics algorithms and CLM2 interchange also have significant impacts, enabling reasonable scaling out to 256 processors. The AlphaServer SC is a cluster of 4-way SMP nodes, so supports at most 4 threads per MPI process, limiting the total number of processors to 256. The next graph describes performance for the same experiments on the p690 cluster. The p690 is a 32-way SMP node, with a limitation of 2 network cards per OS image. However, the 32-way system can be partitioned into 4 8-way "logical" SMP nodes (LPARs), each of which can be populated with 2 network cards. This increases the bandwidth for internode communication significantly. In the following graph we used a single 32-way SMP node for up to 32 processors, then ran on the 8-way logical SMP nodes for 64 or more processors.
Moving from a single 32-way SMP node to the 8-way LPARs, requiring communication over the switch rather than via shared-memory, prevented performance improvement for the MPI-only experiments when going to 64 processors. The hybrid algorithm continues to show performance improvement despite the transition, especially when using the updated dynamics and CLM2 interchange, which have significantly less interprocessor communication. The next graph compares "best" performance between the AlphaServer SC and the p690 cluster. The following two graphs compare the time spent in the physics and the dynamics on each platform as a function of processor count for the best parallel algorithm. The first of these graphs uses a log-linear scale, while the other uses a log-log scale.
As can be seen, the physics does not run efficiently on the p690 when compared to the AlphaServer SC, despite a much higher peak processor performance on the p690. The physics is dominated by sqrt, exp, log, pow, etc., operations that do not use the floating point multiply-add pipes in the POWER4 processor efficiently. The p690 does execute the physics faster than the AlphaServer SC when using 256 processors. In this instance, there is a single chunk per processor. Perhaps OpenMP is more efficient on the AlphaServer SC than on the p690 when there are more chunks than threads, but this advantage diminishes when there is only one thread per chunk. This is just conjecture at this time. For the dynamics, floating point multiply-add is much more common, and the math intrinsic functions much less common. This gives the advantage to the p690. For large processor counts, communication overhead and lack of parallelism beyond 64 MPI processes constrains performance. It is difficult to compare interprocessor communication overhead between these two systems for large processor counts. Both systems prefer one MPI task per SMP node when using 128 and 256 processors, using OpenMP parallelism within a node. For 256 processors, this forces the AlphaServer SC to use 64 MPI processes, which causes load imbalance in the dynamics. In contrast, the p690 can use 32 MPI processes, which does not incur the same performance problems. For a direct comparison, we can compare communication costs when using 16 MPI tasks and 4 OpenMP threads on both systems. (This comparison still has a problem in that two MPI tasks are assigned to a single SMP node on the p690, and only one is assigned on the AlphaServer SC. On the p690, communication between processes on the same node are faster, but the two processes contend for bandwidth when communicating with remote processes.) In this comparison, MPI_SENDRECV communication is approximately 50% faster on the AlphaServer SC than on the p690, and the collective communication calls MPI_SCATTER, MPI_GATHER, and MPI_ALLTOALLV are 20% faster. This indicates than when a 2D decomposition of the dynamics is implemented and the 64 MPI process performance bottleneck is removed, the AlphaServer SC may yet be the better performance for high processor counts. However, when the Federation switch becomes available for the p690 in 2003, the AlphaServer SC advantage in interprocessor communication performance is likely to disappear. When using 256 processors, the physics is approximately the same cost as the dynamics. The physics is continuing to scale well at this point, but the dynamics cost has stopped decreasing. Thus, implemeting a 2D decomposition in the dynamics will improve scalability and performance for large processor counts. However, it will not significantly improve the throughput for the moderate processor counts used in current coupled runs. Improving the serial performance of the physics is likely to have a greater impact on model performance in the near future. |
|||||||||||||||||||||||||||||||||||||
|
ORNL
| Directorate
| CSM
| NCCS
| ORNL Disclaimer
| Search
Staff only: CSM computers | who, what, where? | news |
||||||||||||||||||||||||||||||||||||||
URL: http://www.csm.ornl.gov/evaluation/CAM/progress.eul.html Updated: Monday, 24-Mar-2003 10:33:04 EST webmaster |
||||||||||||||||||||||||||||||||||||||