CCM/MP-2D Performance Studies

Performance Studies using

CCM/MP-2D


CCM/MP-2D is one of the many versions of the NCAR Community Climate Model (CCM). For a brief description of why it exists, its relationship to the standard distributions of the CCM, and the numerical and parallel algorithms used in its implementation use the following links:

Studies

CCM/MP-2D is not the production version of the CCM at the current time. However, something like CCM/MP-2D is necessary if more than 64 processors are to be used (with message passing). Exploiting shared memory, via OpenMP or similar compiler directives, may provide an alternative approach to exploit parallelism on clusters of shared memory multiprocessors without requiring major modifications to the standard distribution. We are examining this approach, with both CCM/MP-2D and the standard CCM distribution, and these results will be added to these web pages as they are produced.

In these studies, we examine the performance of CCM/MP-2D on platforms of interest to the U.S. climate research community, determing optimal tuning parameters and examining the sensitivity of performance to these parameters. CCM/MP-2D has numerous options that can be set to tune performance on a parallel platform. At the most primitive level is the choice of communication protocol, for example which of the many MPI point-to-point communication routines to use, and whether to try to overlap communication with computation or hide latency. Different choices may be appropriate for different phases of the computation, for example the best protocol for the physics load balancing algorithm may be different from the best protocol for the parallel semi-Lagrangian algorithm. At the next level is the choice of parallel algorithm to use to implement the transpose, equivalent to the MPI_ALLTOALLV command, and the collective summation, equivalent to the MPI_ALLREDUCE command. These MPI collective communication commands are also supported options, but they are not always the best choice. At a higher level is the choice of distributed versus transpose parallel FFT algorithm.

At the highest level is the aspect ratio of the logical processor grid and the mapping of the logical processor grid to the real machine. For example, 64 processors can be configured as a 64x1, 32x2, 16x4, etc. logical processor grid, where the first number denotes the number of processors assigned to compute the parallel FFT, and decompose the longitude direction, while the second denotes the number of processors assigned to compute the parallel Legendre transform. Different choices also imply differently shaped domain decomposition patches, which will affect the efficiency of the parallel semi-Lagrangian algorithm.

CCM/MP-2D is a large code with a relatively expensive initialization phase, requiring the input of large static datasets. In a production run, the initialization cost is unimportant, as the code will run for days or weeks. However, in a short tuning evaluation run, the initialization phase dominates the runtime, limiting the number of tests that can be made with the full code. To aid in tuning the performance of CCM/MP-2D, we use the kernel codes COMMTEST, PCRM, and PSTSWM.

COMMTEST tests the performance of exchanging data between two or more processors. For these experiments we look first at the ``peak achievable'' rate for swapping data between two processors, using both unidirectional and bidirectional protocols. The distinguishing feature of this test code is that it uses the same communication primitive wrappers used in CCM/MP-2D, so that the results are relevant to what would be seen in the production code. All available message-passing protocols for exchanging data between two processes are examined, for a large range of message sizes.

CRM is a standalone version of the column radiation model used in the Community Climate Model, and represents one of the major computational tasks in the column physics computations. Because the physics computations are independent between columns and because the parallel implementation does not parallelize individual column calculations, CRM is a sequential benchmark. However, the parallel implementation can affect the performance of the sequential code by, for example, changing the number and memory layout of the columns assigned to a given processor. We have modified CRM slightly to examine these issues. In particular, we have reintroduced array dimensioning on latitude that is found in CCM and added control logic to examine a variety of domain decompositions and support for running multiple instances of the serial code in parallel. To differentiate these modifications from the stock distribution from NCAR, we henceforth refer to the kernel code as PCRM.

PSTSWM is a parallel spectral transform shallow water model that is an accurate representation of the parallel algorithms used for the dry dynamics in CCM/MP-2D. In particular, the parallel algorithms were designed and evaluated in PSTSWM first, then ported to CCM/MP-2D. A series of test suites have been developed for PSTSWM that look at all possible communication protocols for each of the parallel algorithms used in the spectral transform method. From this data, we identify a small number of parallel algorithms and implementations to examine in the context of CCM/MP-2D. We do not currently have a kernel code for the parallel semi-Lagrangian or physics load-balancing algorithms, however some of the PSTSWM options and the COMMTEST results are relevant to these, and provide data sufficient to make intelligent decisions.

However, COMMTEST and PSTSWM cannot determine how the different parallel algorithm options interact in the full code, nor show the effect of the different aspect ratios. Using the full code we test all possible aspect ratios for each of the interesting parallel algorithms and for each total number of processors.

Worley's Performance Studies Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 09:58:47 EDT.
5267 accesses since 1/2/96.