put logo here
CSMD
people
people
sitemap
search

Performance of PSTSWM on the NEC SX-6

NEC SX6 Performance Evaluation

PSTSWM was originally designed as a testbed for parallel spectral algorithms on the sphere. Its serial performance, especially its memory access pattern, is similar to that of the spectral dynamical core used in the NCAR global atmospheric models. By scaling the problem size (horizontal and vertical resolutions), the effect of the memory subsystem on performance can be ascertained. Running multiple versions of the serial code simultaneously allows us to examine the performance impact of memory contention as well. PSTSWM is dominated by floating point multiple/add operations, and can also take advantage of math library FFT routines.

The first three graphs plot the change in single processor performance (MFlops/second/processor) as a function of horizontal resolution for a fixed number of vertical levels, either 1, 16, or 18. The indices for the computational arrays are (longitude, vertical, latitude), while most of the data dependencies are in the longitude and latitude directions. In consequence, the code spends more time going to memory for larger numbers of vertical levels.

Six horizontal resolutions were used:

Each horizontal resolution requires slightly more than 4 times as much memory as the next smaller problem size, with, for example, T21 needS 0.8MB of data space and T85 needS 16MB for a single vertical level. To determine the total memory requirements, multiply the space needed for a single vertical level by the number of vertical levels.

The experiments described here examine the performance of a version of PSTSWM that was ported to the SX-5 in May, 2001. Profiling was used to indicate the routines that were most important to optimize, and numerous experiments involving a variety of index and loop ordering and explicit loop fusion were used to improve performance of these routines. All changes were local to these routines, and the global data structures were not altered. After these optimizations, the same routines were still the most important, but performance had improved by a factor to 10 to 40.

Four of the five curves in the following graphs represent the single processor performance when a single processor is computing. The curves describe performance when running with

The final curve depicts the single processor performance (using -C hopt optimization) when all 8 processors in the SMP node are solving the same problem simultaneously. As each processor is running a separate process, they all have separate address spaces and the only interaction "should be" contention for memory.

The vertical dimension is a vectorization direction, and performance is significantly better for the problems with 16 and 18 levels. Performance is also better for larger horizontal resolutions, up to a point. There is a performance degradation for T85 with larger numbers of vertical levels. Performance for the original version of the code was very poor - on par with the "-O ssafe" optimization results. The modifications for vectorization were restricted to a few subroutines, but performance improved by over a factor of 10 in some cases.

The following table describes the performance of the four most time consuming subroutines when running T42L18. (Data was collected using ftrace.)
PROG.UNIT EXCLUSIVE MFLOPS V.OP AVER.
TIME[sec] ( % ) RATIO V.LEN
pdzpuv 23.551 ( 35.0) 1537.4 99.78 195.1
rsdpiv 17.470 ( 25.9) 2033.0 99.80 237.7
cfftf 9.144 ( 13.6) 3953.3 99.91 256.0
cfftb 5.241 ( 7.8) 4817.2 99.92 250.4
pdzpuv and rspdiv are the inverse and forward Legendre transforms, respectively. cfftf and cfftb are forward and inverse complex fast Fourier transforms. These routines are achieving between 20% and 60% of peak. All routines are Fortran code. (The FFT routines from the Mathkeisan library ran significantly slower than the Fortran equivalents in this application.) Additional code restructuring might improve performance further, but would require modifications to global data structures.

The following table contains the same data for T85L18.
PROG.UNIT EXCLUSIVE MFLOPS V.OP AVER.
TIME[sec] ( % ) RATIO V.LEN
pdzpuv 39.779 ( 39.0) 1455.8 99.84 231.3
rsdpiv 36.855 ( 36.1) 1556.7 99.82 249.7
cfftf 9.391 ( 9.2) 3704.0 99.90 256.0
cfftb 5.324 ( 5.2) 4481.4 99.90 256.0

So, the major decrease in performance was in the forward Legendre transform.

The following graph examines in more detail the performance degradation arising when running on multiple processors . The "-O hopt" experiments are repeated for 1, 2, 4, 6, and 8 processors.

From this data, the performance degradation is a relatively smooth function of the number of processors, which is consistent with the memory contention hypothesis.

The next four graphs compare SX6 performance with that of a number of different platforms. The first two look at performance when using a single processor within an SMP node. The second two graphs examine single processor performance when all processors in the SMP node are solving the same problem.

From these data, the SX6 is can maintain good performance (for problems with reasonable vector lengths) when all processors are computing, while some of the other systems can not. The SX6 is less sensitive to vector length than the SX5, doing better for short vectors, and worse for long vectors.

The next two graphs describe single processor performance for a fixed horizontal resolution (T42) when varying the number of vertical levels. The first of the two describes performance when only a single processor is computing, while the second describes performance when all processors are computing the same problem simultaneously.

Increasing the number of vertical levels increases the the vector length (on vector systems), but also increases dependence on memory performance. As such, the SX-6 shows a general performance improvement with increasing number of vertical levels, while the non-vector systems show a performance degradation.

The final graph restates the previous graph in terms of the total computational rate for a node. The comparison is between the 32 processor p690 and the 8 processor SX-6, as these two systems are more likely to have comparable costs.

From these results, the IBM p690 node has a throughput similar to that of the NEC SX-6 for moderate numbers of vertical levels. The p690 falls behind as the number of vertical levels increases.

The PSTSWM experiments described here represent an extreme case in many ways, with strong emphasis on the performance of the memory hierarchy. PSTSWM is however representative of the dynamics in spectral atmospheric models, and shows that vector systems are appropriate platforms for such models when they are written in a vector-friendly way. In contrast, the scalar performance of the SX5 and SX6 is much worse than the scalar performance of the Compaq and IBM systems. In summary, performance is likely to be very dependent on the application code.

ornl | ccs | csm| disclaimer | search

URL http://www.csm.ornl.gov/evaluation/SX6/PSTSWM.SX6.html
Updated: Friday, 30-May-2003 14:07:51 EDT
webmaster