put logo here
CSMD
people
people
sitemap
search

Performance of AORSA3D (Jan02) on the NEC SX-6

NEC SX6 Performance Evaluation

The AORSA-3D code solves for the wave electric field and heating in a 3-D stellerator plasma heated by radio frequency waves using an all orders spectral algorithm. It represents an important kernel in the "Numerical Computation of Wave-Plasma Interactions in Multi-dimensional Systems" SciDAC project.

AORSA3D is an MPI code that uses SCALAPACK to solve linear systems arising from the spectral discretization. AORSA3D has three major computational phases:

  1. matrix generation
  2. complex linear system solution
  3. current calculation

The performance of the January, 2002 version of the code differs from the previous version of the benchmark in that some unnecessary work in the matrix generation and current calcuation phases were eliminated.

In porting AORSA3D, the vendor-installed version of SCALAPACK was used, which, as shown later, achieved excellent performance. The code was compiled using "-C hopt" optimization for all but one file. The problematic file was compiled with "-C vopt" optimization.

AORSA3D is typically run in a scale-up mode, where the number of modes retained by the model is increased as the number of processors is increased, keeping the memory size per processor approximately constant. The following experimental results describe the performance in terms of the ratio of the number of modes to the execution time. The scaling behavior as a function of the number of processors is not important. If N is the number of modes, then the total memory requirement is O(N**2), while the computational complexity contains both O(N**2) and O(N**3) terms. Thus the ratio necessarily decreases for increasing N once the O(N**3) term becomes dominant.

From these results, AORSA3D does not perform well on the SX-6 for small problem sizes, but has an advantage for large problem sizes. The following three graphs compare the runtime of each computational phase on the SX-6 with the performace on the IBM p690. In these plots, larger values signify worse performance.

From these results, the two phases that have not been modified for vectorization and which do not call math library routines do not perform well on the SX-6, performing 3.5X - 4X slower than the p690. In contrast, the linear system solve achieves very high performance on the SX-6, approximately 3X faster than on the p690. The following graph describes the performance of the linear system solution in terms of computational rate (so larger values signify better performance).

Thus the linear system solution is achieving 75% of peak on the SX-6/8.

ornl | ccs | csm| disclaimer | search

URL http://www.csm.ornl.gov/evaluation/SX6/aorsa3d.jan.sx6.html
Updated: Wednesday, 03-Jul-2002 18:32:14 EDT
webmaster