|
|
|
|
The Parallel Climate Transitional Model (PCTM) is the next generation of the Parallel Climate Model. It is made up of atmosphere, ocean, land surface, and sea ice component models, and a coupler to exchange fluxes between the component models. The atmospheric model is a recent version of the Community Climate Model, developed at the National Center for Atmospheric Research (NCAR). The ocean model is POP (Parallel Ocean Program), developed at Los Alamos National Laboratory (LANL), the National Physical Laboratory (NPL), and NCAR. PCTM is used in production on the IBM SPs at both ORNL and NERSC.For our configuration experiments, we ran a series of experiments using a benchmark problem size specified by Tom Bettge. First, runs were made using 2, 4, 8, 16, and 32 processors on a 32-processor p690. Timings were made using
The alternating mapping represents the performance that would be seen on an "HPC" version of the p690, where the Multichip Module (MCM) only has 4 processors and processors do not share L2 caches.
- the default mapping of processes to processors
- binding processes to consecutive processors, starting with processor 0 (block mapping)
- binding processes to every other processor, starting with processor 0 (alternating mapping)
- running within an 8-processor LPAR, which forces a block mapping when using 8 processes.
![]()
From these results, PCTM performance is improved by at most 15% by using the alternating mapping instead of the block mapping, and scaling is reasonable out to 32 processes on the p690. While not shown here, there is no performance difference between running on 32GB and 128GB p690 nodes.
The second set of experiments examine the performance variation when assigning work to all processors in a p690. For example, if running 8-process PCTM jobs, then 4 jobs would be assigned to the same node. The "Forecast Years per Day" metric now describes the total node output, i.e., the sum of the the number of forecast years per day over all jobs assigned to the node. Timings were made using
- the default mapping of processes to processors
- binding processes for a given job to consecutive processors (block mapping).
- interleaving processes from the different jobs. For example, for four 8-process jobs, processor 0 is assigned process 0 from job 0, processor 1 is assigned process 0 from job 1, processor 2 is assigned process 0 from job 2, etc.
- running an 8-process job in each of the four 8-processor LPARs on a 32-way node,
![]()
The above graph is a close-up of the performance of these four options. The block mapping is best, and performance is further improved when using the LPARs to run the 8-process jobs. The differences here are not great, however, as can be seen from the following graph where the standard scaling experiment is graphed as well.
![]()
The conclusion to be drawn from these results is that LPARs work fine when sharing a node (for this application), and node sharing will not impact PCTM throughput significantly. Note that these data did not include significant I/O. Once the I/O subsystem is configured for production runs, these experiments will need to be repeated.