put logo here
CSMD
people
people
sitemap
search

IBM p690 PSTSWM Configuration Experiments

IBM p690 Configuration Tests

PSTSWM was originally designed as a testbed for parallel spectral algorithms on the sphere. It's serial performance, especially its memory access pattern, is similar to that of the spectral dynamical core used in NCAR global atmospheric models. By scaling the problem size (horizontal and vertical resolutions), the effect of the memory subsystem on performance can be ascertained. Running multiple versions of the serial code simultaneously allows us to examine the performance impact of memory contention as well. PSTSWM is dominated by floating point multiple/add operations, and can also take advantage of math library FFT routines. In the results that follow, FFT routines from the ESSL library were used.

The first two graphs plot the change in single processor performance (MFlops/second/processor) as a function of horizontal resolution for a fixed number of vertical levels, either 1 or 16. The indices for the computational arrays are (longitude, vertical, latitude), while most of the data dependencies are in the longitude and latitude directions. In consequence, the code spends more time going to memory for larger numbers of vertical levels.

Six horizontal resolutions were used:

Each horizontal resolution requires slightly more than 4 times as much memory as the next smaller problem size, with, for example, T21 needing 0.8MB of data space and T85 needing 16MB for a single vertical level. To determine the total memory requirements, multiply the space needed for a single vertical level by the number of vertical levels.

The three curves in the following graphs represent the single processor performance when a single processor in each different type of node is computing. (We have omitted the 64GB node type in these experiments.) Note that for the LPAR results, the processors in the other 3 LPAR nodes sharing the same physical p690 node were idle.

From these results, there is no practical difference in single processor performance between the 32Gb and 128GB nodes. The LPAR node has a small performance advantage that increases as the problem size (and memory dependence) increases.

The next two graphs repeat the above experiment, but simultaneously using all processors in a p690 node. In particular, all processors in all 4 LPAR nodes sharing a p690 node are solving the same problem independently. As each processor is running a separate process, they all have separate address spaces and the only interaction is contention for memory.

As before, there is no practical difference between the performance of the 32GB and 128GB nodes for this benchmark. The LPAR node performance again shows an advantage, but has otherwise a nearly identical response to memory contention.

While experiments using all 32 processors represent the most realistic test of LPAR performance, the full story is somewhat more complicated. The following graphs look at performance using 8-processor LPAR nodes on the same physical p690 node in a number of different ways.

The best single processor performance occurs when only one processor is computing in the entire p690 node. There is a small (< 5%) loss of single processor performance when one processor in each LPAR is computing (for a total of 4 processors in the p690). When all 8 processors in a single LPAR are computing, the contention within the LPAR can decrease performance almost to the same level as when all 32 processors are computing simultaneously. In contrast, the performance when using 16 processor distributed across 4 LPARS (4 processors active in each LPAR) is significantly better than that of the 16 (or 8) processor runs involving only 2 (or 1) LPARs. Note that the 16 processor/4 LPAR ("alternating") performance results used the "bind" command provided by IBM to assign processes to every other processor, eliminating L2 cache sharing. So, either the restricted memory each LPAR has access to increases the potential for memory contention, or forcing processors to share L2 caches is a major determiner of performance. This issue is examined in more detail in later graphs.

Similar experiments were also run using a single (non-LPAR) 32-processor/32GB node. The first two graphs describe performance when using 1, 8, 16, 24, 30, and 32 processors, letting the OS schedule the processes. The third and fourth graph describe performance when using 1, 16, and 32 processors, comparing performance between the default assignment, assigning processes to consecutive processors ("consecutive"), and assigning processes to every other processor ("alternating"). The fifth and sixth graphs describe performance when using 1, 8, and 32 processors.

Once memory contention becomes an issue, there is a relatively smooth performance degradation (in single processor performance) as a function of the number of active processors. If the processes are placed to eliminate L2 cache sharing, this can improve performance for the "mid-range" problem sizes. Forcing processes to share caches decreases performance in the same regime. However, for the largest problem sizes, process placement is not as important as the number of processes.

The final set of graphs compare LPAR and non-LPAR performance when using 16 processors (in a 32 processor p690).

These results verify earlier statements. All things being equal, the performance when using LPAR nodes is similar to or better than when using non-LPAR nodes. However, if using LPARs increases the probability of L2 cache sharing, LPAR performance will be worse.

ornl | ccs | csm| disclaimer | search

URL http://www.csm.ornl.gov/evaluation/CHEETAH/PSTSWM.Config.html
Updated: Thursday, 18-Apr-2002 09:09:17 EDT
webmaster