|
|
Oak Ridge National Laboratory received 24 IBM p690 32-way p690
"Turbo" shared memory nodes in December of 2001 and 3 more nodes in early 2002.
As of May 2002, the final configuration and production system software
are just being completed.
As part of the functionality testing, extensive performance evaluation
studies were undertaken on a single p690 node.
The following results represent achievable lower bounds, in that the
MPI libraries have not yet been tuned for the p690, and the compiler
and operating system continue to evolve.
System Configuration Studies
- OS Upgrade Tests. Comparing the
performance when using AIX 5.1C, AIX 5.1D, and AIX 5.1D with large pages.
- Configuration Tests. Comparing the
performance of the different node types in the ORNL p690 system (using AIX 5.1C).
- Fortran Compiler Tests.
Evaluating the performance sensitivity of two serial benchmarks to
changes in compiler options (using AIX 5.1C).
Benchmark Studies
- Tom Dunigan's
results, including
- EVH1 benchmark. The Enhanced Virginia Hydrodynamics #1
(EVH1) application represents an important kernel in the
"TeraScale Simulations of Neutrino-Driven Supernovae and Their
Nucleosynthesis" SciDAC project.
EVH1 is an MPI code, and the following scaling results are for a fixed
size problem.
The POWER4 system shows excellent scalability up to 32 processors
for this problem size (as do the POWER3 II, EV67, and EV68 systems),
outperforming the POWER3 system by more than a factor of 3 when using
32 processors. For this particular problem size, no more than 128
processors can be used, and communication costs affect scaling on all
platforms when using more than 64 processors.
- AORSA3D benchmark. The AORSA-3D code solves for the wave electric
field and heating in a 3-D stellerator plasma heated by radio
frequency waves using an all orders spectral algrithm. It represents
an important kernel in the
"Numerical Computation of
Wave-Plasma Interactions in Multi-dimensional Systems"
SciDAC project. AORSA3D is an MPI code that uses SCALAPACK
to solve linear systems arising from the spectral discretization.
AORSA3D is typically run in a scale-up mode, where the number of
modes retained by the model is increased as the number of
processors is increased, keeping the memory size per processor
approximately constant. The following experimental results
describe the performance in terms of the ratio of the
number of modes to the execution time. The scaling behavior
as a function of the number of processors is not important.
If N is the number of modes, then the total memory requirement is
O(N**2), while the computational complexity contains both
O(N**2) and O(N**3) terms.
Thus the ratio necessarily decreases for increasing N once the
O(N**3) term becomes dominant.
- PCTM benchmark runs. The Parallel Climate Transitional Model (PCTM)
is the next generation
of the Parallel Climate Model.
It is made up of atmosphere, ocean, land surface, and sea ice
component models, and a coupler to exchange fluxes between the
component models. The atmospheric model is a recent version of
the Community Climate Model, developed at the National Center for
Atmospheric Research (NCAR). The ocean model is POP (Parallel Ocean Program),
developed at Los Alamos National Laboratory (LANL), the National Physical
Laboratory (NPL), and NCAR. PCTM is used in production on the IBM
SPs at both ORNL and NERSC. Detailed performance results from last year
are available here.
The following graph plots the number of years
that can be simulated in one day as a function of the number of
processors.
- CCM/MP-2D benchmark. CCM/MP-2D is the massively parallel implementation of
version 3.6.6 of the Community Climate Model (CCM). It was developed originally
to determine how best to parallelize the CCM, and the results from this
research are being used in the parallelization of the
Community Atmoshperic Model (CAM). CCM/MP-2D is currently used for
benchmarking parallel systems.
The first two graphs describes the observed computational rate as a
function of processor count for two fixed problem sizes. The flop count
was determined using on SpeedShop on an SGI Origin. As this count
gives equal weight to all floating point operations (including, for example,
square root), its meaning is somewhat suspect. The graph is
best interpreted as a consistently weighted plot of inverse runtime.
The second two graphs describes the sources of the performance loss on the
p690 system when increasing the number of processors.
The sources are graphed accumulatively, so the top curve represents the
total loss in efficiency compared to a single processor run. For 32
processors, this loss is 49% for problem T42L18. Approximately 16% of
the loss is due to communication costs. While other sources
(load imbalance, copy costs, duplicate computation) contribute to
the parallel overhead, the primary source of performance degradation
for larger processor counts is in the "other" category, which we attribute
to a decrease in computational rate. This could be due to locality issues
or memory contention. It could also be due to the effect of the change
in granularity from the domain decomposition (loops getting shorter,
resulting in less efficient streaming?). It is probably a function of
all three, and it will be interesting to see how this changes with later
versions of the system software. It will also be interesting to
measure performance when using more than 32 processors. Up to 32
processors, all processors are on the same node, so the memory
demands actually go up even though the per process memory granularity
decreases as a function of the number of processors. Involving
two nodes (64 processors) will increase MPI communication costs, but will
also approximately halve the memory requirements.
- PSTSWM processor benchmark. PSTSWM
represents an important computational kernel in spectral global atmospheric
models. As 99% of floating point operations are multiply or add,
it runs well on systems optimized for these operations.
It also exercises the memory subsystem as the problem size is
scaled and can be used to evaluate the impact of memory contention
in SMP nodes. In the following plot, the effect of increasing problem size
on single processor performance is examined.
Each problem size requires approximately 4 times as much memory as the
next smaller problem size, with T5 needing approx. 60KB of data space and
T170 needing approx. 76MB.
Details can be viewed
here and
here.
- PCRM processor benchmark. PCRM
is a slightly modified version of the column radiation model
used in version 3.6.6 of the NCAR Community Climate Model (CCM).
It (also) represents an important computational kernel in atmospheric
models. It differs significantly from PSTSWM in that over 6% of
the floating point operations are divide or square root. This makes vastly
different demands on the processor and compiler.
The data structures used by this model
are of the form (longitude, vertical, latitude).
The experiment being described has a fixed problem size: 512 columns, each
with 18 vertical levels. What is being varied is the length
of the longitude and latitude dimensions, keeping their product fixed at 512.
From these results, the POWER4 processor and compiler have the same
qualitative performance characteristics as the POWER3 II, but with
significantly improved quantitative performance. All the same, if
higher order transformations can not be tolerated for numerical
accuracy reasons, then performance is seriously degraded.
More details on the experiment can be viewed
here .
|
|