|
|
|
|
Preliminary NEC SX-6 Results
Cray invited researchers at Oak Ridge National Laboratory to evaluate the performance of a NEC SX-6/8 demonstration system. This system was sited at Chippewa Falls when the evaluation began, but was moved to the Arctic Research Supercomputer Center in mid-June, 2002. The results presented here are very preliminary, and subject to numerous qualifiers. Our first step was to run codes that we ported to an SX-5 system in May, 2001. Note that we do not claim that these represent the optimal optimizations for the codes on the vector architecture.
- PSTSWM SX-6 Experiments (May 2002)
- While subject to some interpretation, the PSTSWM results indicate the following.
- Running the code without modifying for vectorization demonstrated very poor performance (never more than 500 MFlops/sec, and typically less than 250 MFlops/sec).
- Modest modifications were sufficient to achieve 2.7 GFlops/sec in the best case. The Fourier transform (coded in Fortran) achieved as much as 50% of peak. The Legendre transform (coded in Fortran) achieved as much as 25% of peak. These are the performance critical operations.
- Performance increases with problem size, up to a point. Performance begins to decrease for the largest horizontal resolutions. This may indicate that different code modifications are required for different problem sizes. The code currently specifies all loop lengths at runtime. Compile-time specification of problem size may also have a performance impact.
- Running instances of the code on all processors of the SMP node simultaneously showed some performance degradation, but much less than is seen on the IBM systems.
- System comparisons using a climate-size problem resolution (T42L18):
- a single processor in the SX-6/8 is 2.5 times faster than a single processor in an IBM p690
- an 8 processor SX-6 SMP node has a somewhat greater throughput (30%) than a 32 processor p690 when making simultaneous serial runs
- PCRM SX-6 Experiments (June 2002)
- The PCRM results indicate the following.
- As with PSTSWM, both vector length and appropriate compiler options were required to achieve good vector performance. The dynamic performance range is over a factor of 40, and the peak performance is over 40% of peak for a single processor run.
- PCRM was designed to be run on a vector system. However, the programming style is not inappropriate to non-vector processors, and this code is a "fair" benchmark. Note that the column radiation model in the most recent NCAR global atmospheric model is no longer vector friendly, and at least one of the new science algorithms appears to be difficult to vectorize efficiently. To reintroduction vectorization may require "modifying" the science, which may not be acceptable.
- Memory contention is not a significant performance problem for this benchmark (code and problem size) on either the SX-6 or the other platforms for which we have data.
- System comparisons using a climate-size problem resolution (18 vertical levels):
- For long vectors, the SX-6 is 6 times faster than the IBM POWER4.
- As vectors shorten (which will occur as parallelism is exploited), the nonvector systems become competitive or better than the vector systems.
The second step of the evaluation was to port and benchmark codes that have not been run on NEC vector systems before, but for which which we have recent benchmark data on other systems. Initially, we did not attempt any optimization other than experimentation with compiler options. After collecting this baseline data, we then used profiling to identify the performance critical routines and attempted to improve the vectorization for these. At most a few days were spent in the code optimizations, and more performance could be obtained with more extensive code modifications.
- AORSA3D SX-6 Experiments (June 2002) - August, 2001 version
- Two different versions of AORSA3D were used for benchmarking, both written by Fred Jaeger at Oak Ridge National Laboratory. AORSA3D is an important kernel in the "Numerical Computation of Wave-Plasma Interactions in Multi-dimensional Systems" SciDAC project.
The first version of the code we used was released in August, 2001. Benchmark data is available for a number of different platforms using this version. The performance results indicate the following.
- As in the PSTSWM and PCRM results, code that has not been explicitly modified for vectorization performs poorly, while code that has been modified, or which calls vendor-supplied math library subroutines, performs very well. The runtime of AORSA3D is dominated by the solution of the linear system on the nonvector platforms. On the SX-6/8, the linear system solution is very efficient and the opposite is true.
- As no attempt was been made to restructure the code currently not vectorizing well, we do not know what performance is possible.
- In its current form, AORSA3D is approximately a factor of two slower on the SX-6/8 than on the IBM p690 for all problem sizes we examined. For larger problem sizes, the SX-6/8 performance will improve relative to that of the other platforms as the computational complexity of the linear system solution increases relative to that of the other computational phases.
- AORSA3D SX-6 Experiments (July 2002) - January, 2002 version
- The second version of AORSA3D was released in January, 2002. It has slightly improved science, and unnecessary work in the matrix generation and current calculations phases was eliminated. As the linear system solution dominates the runtime on nonvector systems, we did not update the benchmark previously. However, the modifications are in the vector-unfriendly parts of the code, and this version performs much better on the SX-6. So far we have only been able to collect performance data for this version of AORSA3D on the IBM p690 and NEC SX-6/8.
- As before, the code that has not been explicitly modified for vectorization performs poorly. This code is less important in the Jan02 version of AORSA3D. Performance on the SX-6 is less than that of the IBM p690 for the small test cases, but is 50% better for the largest benchmark problem. The linear system solution is a factor of 3 faster on the SX-6 than on the p690 on a per processor basis, and larger problem sizes will be dominated by the linear system solution.
- Profiling and performance counters were used to identify code that performs poorly on the SX-6. Attempts to improve the performance of these routines were not successful. Talking with the developers, it is clear that the matrix creation and current calculation phases could be made more vectorizable, but this would require a significant reworking of the code.
- EVH1 SX-6 Experiments (July 2002)
- EVH1 is a hydrodynamics code originally written by John Blondin at North Carolina State University and is an important kernel in the ``TeraScale Simulations of Neutrino-Driven Supernovae and Their Nucleosynthesis'' SciDAC project. These performance results indicate the following.
- Two of the EVH1 routines do not run well on the SX-6, but these were easily modified. Remaining performance problems appear to be a function of vector length. Moving the outer loops over coordinate direction into the subroutines would likely correct this problem, but would require essentially rewriting the code.
- The modified version of the code running on the SX-6 is twice as fast the original code running on the IBM p690, achieving approximately 15% of peak on the SX-6.
The third step of the evaluation was to port and benchmark codes designed to evaluate the performance of specific subsystems.
- COMMTEST SX-6 Experiments (July 2002)
- COMMTEST is a suite of codes for evaluating interprocessor communication performance. When using a single node of the SX-6/8, we are evaluating the shared memory implementation of MPI. The performance results indicate the following.
- Latency (as measured by these test codes) was approximately 6 microseconds, while maximum observed bidirectional bandwidth was over 14 GBytes/sec. The largest message size used in these experiments was 2MB, and larger bandwidth rates would have been achieved if larger message sizes were used.
- All processors simultaneously sending and receiving data degraded the single process bandwidth when swapping messages larger than 32KB, but never by more than 30% in these experiments. The maximum observed node bandwidth was approximately 48GBytes/sec. Using messages sizes larger than 2MB would have achieved even higher performance.
- The SX-6/8 and IBM p690 have almost identical performance for 2KB messages sizes or smaller when a single pair of processors is communicating. For the largest messages, SX-6/8 performance is seven times better than p690 performance.
- When all processors are communicating in the node, the SX-6/8 always achieves better performance than the p690, as much as 20 times better when swapping 2MB messages.