home  |  about us  |  contact  
 

 CSM Home  
 CSM Home

   

Evaluation of Early Systems


IBM p690 Results

Oak Ridge National Laboratory received 24 IBM p690 32-way p690 "Turbo" shared memory nodes in December of 2001 and 3 more nodes in early 2002. As of May 2002, the final configuration and production system software are just being completed. As part of the functionality testing, extensive performance evaluation studies were undertaken on a single p690 node. The following results represent achievable lower bounds, in that the MPI libraries have not yet been tuned for the p690, and the compiler and operating system continue to evolve.

System Configuration Studies

  • OS Upgrade Tests. Comparing the performance when using AIX 5.1C, AIX 5.1D, and AIX 5.1D with large pages.

  • Configuration Tests. Comparing the performance of the different node types in the ORNL p690 system (using AIX 5.1C).

  • Fortran Compiler Tests. Evaluating the performance sensitivity of two serial benchmarks to changes in compiler options (using AIX 5.1C).

Benchmark Studies

  • Tom Dunigan's results, including
  • EVH1 benchmark. The Enhanced Virginia Hydrodynamics #1 (EVH1) application represents an important kernel in the "TeraScale Simulations of Neutrino-Driven Supernovae and Their Nucleosynthesis" SciDAC project. EVH1 is an MPI code, and the following scaling results are for a fixed size problem.

    The POWER4 system shows excellent scalability up to 32 processors for this problem size (as do the POWER3 II, EV67, and EV68 systems), outperforming the POWER3 system by more than a factor of 3 when using 32 processors. For this particular problem size, no more than 128 processors can be used, and communication costs affect scaling on all platforms when using more than 64 processors.

  • AORSA3D benchmark. The AORSA-3D code solves for the wave electric field and heating in a 3-D stellerator plasma heated by radio frequency waves using an all orders spectral algrithm. It represents an important kernel in the "Numerical Computation of Wave-Plasma Interactions in Multi-dimensional Systems" SciDAC project. AORSA3D is an MPI code that uses SCALAPACK to solve linear systems arising from the spectral discretization.

    AORSA3D is typically run in a scale-up mode, where the number of modes retained by the model is increased as the number of processors is increased, keeping the memory size per processor approximately constant. The following experimental results describe the performance in terms of the ratio of the number of modes to the execution time. The scaling behavior as a function of the number of processors is not important. If N is the number of modes, then the total memory requirement is O(N**2), while the computational complexity contains both O(N**2) and O(N**3) terms. Thus the ratio necessarily decreases for increasing N once the O(N**3) term becomes dominant.

  • PCTM benchmark runs. The Parallel Climate Transitional Model (PCTM) is the next generation of the Parallel Climate Model. It is made up of atmosphere, ocean, land surface, and sea ice component models, and a coupler to exchange fluxes between the component models. The atmospheric model is a recent version of the Community Climate Model, developed at the National Center for Atmospheric Research (NCAR). The ocean model is POP (Parallel Ocean Program), developed at Los Alamos National Laboratory (LANL), the National Physical Laboratory (NPL), and NCAR. PCTM is used in production on the IBM SPs at both ORNL and NERSC. Detailed performance results from last year are available here.

    The following graph plots the number of years that can be simulated in one day as a function of the number of processors.

  • CCM/MP-2D benchmark. CCM/MP-2D is the massively parallel implementation of version 3.6.6 of the Community Climate Model (CCM). It was developed originally to determine how best to parallelize the CCM, and the results from this research are being used in the parallelization of the Community Atmoshperic Model (CAM). CCM/MP-2D is currently used for benchmarking parallel systems.

    The first two graphs describes the observed computational rate as a function of processor count for two fixed problem sizes. The flop count was determined using on SpeedShop on an SGI Origin. As this count gives equal weight to all floating point operations (including, for example, square root), its meaning is somewhat suspect. The graph is best interpreted as a consistently weighted plot of inverse runtime.

    The second two graphs describes the sources of the performance loss on the p690 system when increasing the number of processors.

    The sources are graphed accumulatively, so the top curve represents the total loss in efficiency compared to a single processor run. For 32 processors, this loss is 49% for problem T42L18. Approximately 16% of the loss is due to communication costs. While other sources (load imbalance, copy costs, duplicate computation) contribute to the parallel overhead, the primary source of performance degradation for larger processor counts is in the "other" category, which we attribute to a decrease in computational rate. This could be due to locality issues or memory contention. It could also be due to the effect of the change in granularity from the domain decomposition (loops getting shorter, resulting in less efficient streaming?). It is probably a function of all three, and it will be interesting to see how this changes with later versions of the system software. It will also be interesting to measure performance when using more than 32 processors. Up to 32 processors, all processors are on the same node, so the memory demands actually go up even though the per process memory granularity decreases as a function of the number of processors. Involving two nodes (64 processors) will increase MPI communication costs, but will also approximately halve the memory requirements.

  • PSTSWM processor benchmark. PSTSWM represents an important computational kernel in spectral global atmospheric models. As 99% of floating point operations are multiply or add, it runs well on systems optimized for these operations. It also exercises the memory subsystem as the problem size is scaled and can be used to evaluate the impact of memory contention in SMP nodes. In the following plot, the effect of increasing problem size on single processor performance is examined. Each problem size requires approximately 4 times as much memory as the next smaller problem size, with T5 needing approx. 60KB of data space and T170 needing approx. 76MB.

    Details can be viewed here and here.

  • PCRM processor benchmark. PCRM is a slightly modified version of the column radiation model used in version 3.6.6 of the NCAR Community Climate Model (CCM). It (also) represents an important computational kernel in atmospheric models. It differs significantly from PSTSWM in that over 6% of the floating point operations are divide or square root. This makes vastly different demands on the processor and compiler.

    The data structures used by this model are of the form (longitude, vertical, latitude). The experiment being described has a fixed problem size: 512 columns, each with 18 vertical levels. What is being varied is the length of the longitude and latitude dimensions, keeping their product fixed at 512.

    From these results, the POWER4 processor and compiler have the same qualitative performance characteristics as the POWER3 II, but with significantly improved quantitative performance. All the same, if higher order transformations can not be tolerated for numerical accuracy reasons, then performance is seriously degraded. More details on the experiment can be viewed here .




   
  ORNL | Directorate | CSM | NCCS | ORNL Disclaimer | Search
Staff only: CSM computers | who, what, where? | news
 
URL: http://www.csm.ornl.gov/evaluation/CHEETAH/index.html
Updated: Wednesday, 18-Sep-2002 11:21:13 EDT

webmaster