PSTSWM Serial Performance

Performance Studies using

PSTSWM


Serial Performance

The following studies describe the serial (1 processor) performance of the code PSTSWM for a number of different problem sizes. PSTSWM computes the solution by timestepping, advancing the approximation to a new timelevel (in simulation time) by using the approximations at the two previous timelevels. In the following, we will refer to the process of advancing the approximation to a new timelevel as a step.

The computational complexity and code executed for a step in PSTSWM are identical for all steps except the first, and all steps "should" require the same amount of execution time. The first step is actually a double step, each advancing the simulated time half the distance of a normal step, and is counted as two steps in the following discussion and presentation of results.

Performance is reported in number of seconds of execution time per step and in MFlop/sec. The time reported is the average over all steps (total time/number of steps). The code is run multiple times for each problem size, and the fastest time is used. We also measure the minimum and maximum execution times for individual steps for a given run, and this information is used to determine whether extraneous effects (rare system interrupts or other users) have contaminated the timing unacceptably.

The goal of these studies is to determine the "peak achieveable" serial performance that would be attained in long (production) simulations, to establish a baseline for determining parallel scalability. To better approximate the performance achieved in long simulations when timing only a relatively short run, we calculate one or more steps, then reinitialize, before beginning timing. This guarantees that the code and data memory have all been "touched" before timing begins, eliminating some start-up performance artifacts. It also eliminates the time for loading and initializing the program.

The MFlop rate is approximated using the floating point operation count measured by the Speed Shop tools "ssrun -ideal" and "prof -archinfo" on an SGI Origin 2000 at Los Alamos National Laboratory. Compiler optimization was set at "-64 -O3" with PSTSWM version 6.5. Multiple runs were performed with differing compiler optimization options, and this particular optimization level produced the minimum number of floating point operations. The number of steps computed were also varied in the measurements, so that operations corresponding to initialization and other start-up overhead could be removed, and the operation counts used correspond to the timings. This MFlop rate metric is not the actual MFlop rate on any platform other than the Origin. But the ratings are consistent between problem sizes and across platforms, and are easier to use for comparisons than the raw timings.

Note that the timings on the different platforms do not all use the same version of PSTSWM, but most of the code differences are in the parallel implementations. In particular, the computational complexity is identical for a given problem size across all platforms.

We use the standard benchmark problem for the shallow water equations - global steady state nonlinear zonal geostrophic flow - as described in

Six problem size classes are used: T5, T10, T21, T42, T85, and T170, characterized by the following computational grids and complexity:

physical grid Fourier grid spectral coefficients flops per step
T5 16 X 8 8 X 8 21 31417
T10 32 X 16 16 X 16 66 148237
T21 64 X 32 32 X 32 253 764917
T42 128 X 64 64 X 64 946 4040627
T85 256 X 128 128 X 128 3741 23792575
T170 512 X 256 256 X 256 14706 150508184

There is also a vertical component to the problem size. For example, T42L16 is a T42 horizontal grid with 16 vertical levels. The complexity of solving the problem is linear in the number of vertical levels. (In contrast, the complexity with respect to longitude and latitude have n*log(n) and n*n dependencies, respectively).

The final two problem size parameters are the number of steps computed and the precision used in the computation. The complexity is linear in the number of steps. The number of steps is included with the descriptions of the results. 64-bit precision floating point computation is used in all experiments. The Speed Shop data also indicates that over 99% of the floating operations are either floating point add or floating point multiply.

The vertical level is the second index in the arrays: (longitude, vertical, latitude, field), and all levels are operated on "simultaneously" during a timestep. To be more precise, loop orderings are consistent with the array index orderings, to the extent possible, as this improves performance given this index ordering. In consequence, each stage of the timestep calculation (calculation of nonlinear products and forcing terms, Fourier transforms, Legendre transforms, and update for new timelevel) is completed for all levels before advancing to the next stage. As the number of levels increases, the dependence on data from main memory (rather than registers or cache) increases, and serial PSTSWM benchmark runs have characteristics common with "STREAMS"-like benchmarks, with performance increasingly sensitive to the bandwidth to/from memory. (For the most part, memory accesses are "stride 1", so the sensitivity is primarily to bandwidth, and not latency.)

Performance can be improved on cache-based architectures for large numbers of vertical levels by permuting the index ordering to (longitude, latitude, vertical, field) and completing all computation on a given level for the current timestep before beginning on the computation for the next level. We will refer to this as the "layer" approach, as distinct from the "level" approach. The level approach is what is consistent with the full atmospheric code CCM/MP-2D , and was chosen for its excellent vectorization properties and the efficiency of this index ordering for the column physics computations in the full code. However, large discrepancies between the "level" and "layer" PSTSWM performance for problem granularities of interest may force this decision to be revisited in future versions of the CCM.

Up to three different types of results are presented. The first type is tabular data for 1, 2, 3, and 16 levels for a number of different compiler options and both with and without math libraries. The second type is graphical data for all levels for which there exists data, for interesting or representative compiler options. The third type is graphical data of the serial time when running on one processor and when running multiple instances of the serial code. This can show the effect of contention for memory and shared system resources on SMP nodes. For example, for a system with 4-way SMP nodes, we run on 1 processor (with the other 3 processors in the node idle), on 2 processors, and on 4 processors, as well as on 4 processors spread across 4 nodes. (Timings for this last case should be identical to the 1 processor timings.) The data is graphed in two different ways, as a function of the problem size (number of vertical levels) per processor and per node. The per processor graphs have the same per processor problem granularity, and so the same cache hit/miss characteristics, but significantly different total memory size required per node. The per node graphs have similar per node memory requirements across the different experiments, but very different per processor problem granularities. Only the overall best compiler options and choices of math libraries are used for the multiple instance experiments.

RESULTS:

 
Compaq
AlphaSC-500
AlphaSC-667
AlphaSC-1000
 
Cray Research
T3D
T3E-600
T3E-900
 
HP/Convex SPP
SPP-1200
SPP-2000
 
IBM
SP2-66 (using ESSL math library)
SP2-120
SP3-200 Winterhawk I
SP3-222 Nighthawk I
SP3-375 Winterhawk II
p690 (1.3 GHz Regatta H)
 
Intel
Paragon
Pentium II cluster (266MHz; Linux)
 
SGI
Origin2000-195
Origin2000-250
 
Sun
UltraSPARC-III (750 MHz)
 
Platform Comparisons
Platform Comparisons (no math library)

ADDENDUM

Some of the MFlop rates reported here for the older experiments are based on floating point operation counts returned by the hardware performance monitor for a single processor run on a Cray C90. As on the Origin, multiple runs were performed with differing compiler optimization options, and we used the minimum number of floating point operations measured. The number of steps computed were also varied in the measurements, so that operations corresponding to initialization and other start-up overhead could be removed, and the operation counts used correspond to the timings. The results are as follows:

physical grid Fourier grid spectral coefficients flops per step
T42 128 X 64 64 X 64 946 4129859
T85 256 X 128 128 X 128 3741 24235477
T170 512 X 256 256 X 256 14706 153014243

These counts are slightly larger than those measured on the Origin, but are similar enough that the MFlop ratings calculated using these counts are still comparable estimates.

PSTSWM Performance Page


Patrick H. Worley / ( worleyph@ornl.gov)
Last Modified Monday, 15-Jul-2002 10:00:01 EDT.
8001 accesses since 1/2/96.