PCRM is a slight modification of the NCAR Column Radiation Model
(CRM), and is given a distinct name simply to denote that it differs
from the stock distribution available from NCAR, and is likely to lag
behind updates to CRM.
CRM is a standalone version of the column radiation model
used in the Community Climate Model, and represents one of the
major computational tasks in the column physics computations.
To quote the CRM documentation,
... the CRM is a
physical process model which isolates the energetics of radiative
transfer from the rest of the CCM3. The CRM is built from the
radiation routines from CCM3, along with a simple text interface for
the user to input information needed by the radiation calculation.
Because the physics computations are independent between columns
and because the parallel implementation used in CCM/MP-2D does not
parallelize individual column calculations, we use (P)CRM to determine
single processor performance.
However, the parallel implementation can affect the performance of
CRM by, for example, changing the number of columns and the memory
layout of the column data assigned to a given processor.
To examing the impact of the parallel implementation on the
performance of the column radiation calculations, we modified
CRM in the following ways.
Completed support for plon > 1. When plon > 1,
CRM computes a set of columns indexed by longitude, which is the first
index in the data and results arrays. All columns are computed
"simultaneously".
Added support for plat > 1,
in particular, adding a latitude index to all arrays in CRM that have
a latitude index in CCM. When plat > 1, CRM
computes plon sets of columns plat times. Beyond increasing the
amount of work, varying the plat index (which is the last index in the
arrays) changes how CRM accesses memory, which may affect performance.
Added timing logic.
Added initialization logic, to support timing from a cold start
and after some number of warm-up runs.
Removed netcdf dependencies.
Added MPI logic, to allow multiple instances to be run
Up to three different experiments are run.
Index layout sensitivity.
For a fixed number of columns, the number of longitudes
is varied, requiring a corresponding "inverse" variation in the
number of latitudes. This represents different domain
decompositions, e.g. varying the number of processors used to
decompose latitude and longitude for a fixed number of processors.
The issue to be examined is the performance sensitivity of computing
a small number of columns at a time (plon small) or a large number
at a time (plon large). Good vectorization requires plon to be large.
Cache-based processor architectures may prefer plon small. Setting
plon=1 is equivalent to setting the vertical level as the first index.
Square decomposition scaling.
The total number of columns is varied, assigning
approximately the same number to both longitude and latitude.
This corresponds to a "square" domain decomposition as the number of
processors is varied.
1D decomposition scaling.
The number of latitudes is varied for a fixed number of
longitdues. This corresponds to the scaling for a one dimensional
(latitude) domain decomposition such as that used in the
production version of CCM.
For each experiment, three additional issues may also be investigated.
Compiler option sensitivity.
A range of compiler options are examined for each experiment,
to determine which options are best, and the sensitivity of
performance ot the choice of the options.
Note that only the standard and aggressive optimizations are examined.
Instruction and data cache state sensitivity.
Experiments are run with timing beginning immediately,
after computing a single column, and after running the whole
experiment once without timing. This examines the sensitivity
to the "first time" perturbation and other instruction and
data caching issues.
Multiple instance sensitivity in a shared memory node.
The serial code is running on multiple processors simultaneously.
This examines the effect on performance of multiple processors
contending for memory in a shared memory node.
All results are presented in terms of MFlops/second, where the
floating point operations where counted using the
Speed Shop tools "ssrun -ideal" and "prof -archinfo". Compiler optimization
was set at "-64 -O3". The results for a single 18 level column using
the "cloudy day" input provided with CRM are as follows.