| home | about us | contact | ||||
![]() |
| |||
| CSM Home | |||||||||||||||||||||||||||||||||
|
PSTSWM on the Cray X1Platform ComparisonsThe following four graphs compare X1 performance with that of a number of different platforms. The first two look at performance when using a single processor within an SMP node when using 1 and 18 vertical levels, respectively.
From these data, the Cray X1 has the typical vector performance signature. For small problems, the vector hardware is not utilized efficiently, and the nonvector systems demonstrate better performance. For larger problem sizes, the Cray X1 performs very well compared to the nonvector systems. This occurs both because of the more efficient use of vector hardware and because of the performance degradation on the nonvector systems due to increased dependence on the performance of the memory subsystem. Specifying the number of vertical levels at compile-time is very important when using only one vertical level for the Cray X1. We have not yet examined this approach on the other systems. However, we do not expect this to change the performance on either the SX-6 or the nonvector systems. (In the SX-6 port, we compress the loops over vertical levels with other loops, so compile-time specification of the number of vertical levels should not change the performance of this approach.) Note that we are not confident that the SX-6 performance reported here can not be improved. We worked very hard to optimize performance, trying many different implementations of PSTSWM on the SX-6, and report the best observed results here. This implementation is as different from the X1 port as it is from the original version of the code. As indicated earlier, the SX-6 version of the code is not efficient on the X1, and X1 port is not efficient on the SX-6. The SX-6 performance is similar to the Cray X1 for the small problem sizes, but it does not scale well with horizontal problem size. We have asked NEC to comment on our SX-6-specific optimizations. Also note that SSP performance of the X1 does not significantly exceed performance of the nonvector systems for this code. With a peak performance of 3.2 GFlops/sec, as compared to a peak of 5.2 GFlops on the IBM p690, this is not that unexpected. However, as argued earlier, the MSP is a more natural choice of processor for the Cray X1. The Cray compiler effectively uses all four SSPs on a given problem without resorting to explicit user-level parallelism via, for example, OpenMP or MPI. The next two graphs examine single processor performance when all processors in the SMP node are solving the same problem. Thus, 32 instances of the given problem size are solved simultaneously on the IBM p690, while only 4 instances are solved simultaneously on the Cray X1. This is fair, because 32 instances on the Cray X1 would use 8 SMP nodes, and the performance on one SMP does not affect the performance on other SMP nodes (in this experiment), and the per processor performance is identical for the 32-processor and 4-processor experiment on the Cray X1. (This has been verified, but the results are not shown here.)
As noted earlier, the Cray X1 does not show performance degradation from memory contention in this experiment. The X1 is the only system among those examined with this behavior. Even the SX-6 shows some performance degradation when running on all processors simultaneously. This performance degradation is primarily a function of memory contention. Experiments were also run with all but one processor in the SMP, all but two, etc., and the performance degradation on these systems (other than the X1) is relatively linear as a function of problem instance. This behavior further improves the performance comparison between the Cray X1 and the other platforms. The next four graphs describe single processor performance for fixed horizontal resolutions (T42 and T85) when varying the number of vertical levels. The first two describe performance when only a single processor is computing, while the second two describe performance when all processors are computing the same problem simultaneously. Note that data on the performance of compile-time specification of the number of vertical levels were collected for 1, 18, and 66 levels only.
These views of the performance data demonstrate that performance on the Cray X1 is essentially an increasing function of the number of vertical levels for small numbers, eventually approaching an asymptote. A significant percentage of the maximum performance is reached by 18 vertical levels. In contrast, on the nonvector systems, performance is a decreasing function of the number of vertical levels, also appearing to approach an asymptote for more than 18 vertical levels. Running simultaneous instances of the code does not change this performance behavior on the Cray X1, and lowers the performance on the nonvector systems without changing the trends.
The PSTSWM experiments described here represent an extreme case in
many ways, with strong emphasis on the performance of the memory
hierarchy.
PSTSWM is however representative of the dynamics in spectral
atmospheric models, and shows that vector systems are appropriate
platforms for such models when they are written in a
vector-friendly way. In contrast, the scalar
performance of the Cray X1 and the SX-6 is much worse than the scalar
performance of the Compaq and IBM systems. In summary, performance
is likely to be very dependent on the application code.
|
||||||||||||||||||||||||||||||||
|
ORNL
| Directorate
| CSM
| NCCS
| ORNL Disclaimer
| Search
Staff only: CSM computers | who, what, where? | news |
|||||||||||||||||||||||||||||||||
URL: http://www.csm.ornl.gov/evaluation/PHOENIX/PSTSWM-platform.CRAYX1.html Updated: Thursday, 19-Jun-2003 13:07:48 EDT webmaster |
|||||||||||||||||||||||||||||||||