home  |  about us  |  contact  
 

 CSM Home  
 CSM Home

   

PSTSWM on the Cray X1


Code Version Comparisons

These experiments compare the single (MSP) processor performance of different implementations of PSTSWM:
  • original version
  • port to NEC SX-6
  • port to Cray X1
  • port to Cray X1 with vertical dimension declared at compile time.
The numerical algorithms and global data structures are identical in all of these. All that has been changed are the compiler directives, loop structures, and local array definitions in computation-intensive "leaf" routines. Note that one option we examined is fixing the vertical problem dimension at compile-time. In each computational phase of PSTSWM two of the three problem dimensions (longitude, latitude, vertical) are available for vectorization and streaming, one of which is always the vertical. Specifying the vertical dimension at compile-time gives the compiler additional information for optimization. While this restricts the choice of domain decomposition and parallel algorithms in a parallel run, it is not an unreasonable restriction in many situations.

The following graphs compare the performance of the four different implementations when compiled with agg2 optimization and run with 64 MByte pages. The first three figures plot the performance as a function of horizontal resolution for 1, 18, and 66 vertical levels. The other two figures plot the performance as a function of number of levels for fixed horizontal resolutions of T42 and T85.

From these data, the original code does not perform well on the X1, apparently running primarily on the scalar unit. The version ported and tuned for the SX6 also does not perform as well (on the X1) as the version ported and tuned specifcally for the X1. This is most obvious for large horizontal resolutions. Finally, fixing the number of vertical levels at compile time improves performance significantly for small numbers of vertical levels. The current hypothesis is that the compiler attempts to stream over the vertical level loop, which, if there are fewer than 4 levels, results in idle hardware. By specifying loop lengths at compile time, the compiler can make more appropriate decisions. This interpretation is consistent with the fact that the performance does not increase much as the number of vertical levels increases for the compile-time cases. In contrast, performance increases significantly from 1 to 18 vertical levels when using runtime specification of vertical levels.




   
  ORNL | Directorate | CSM | NCCS | ORNL Disclaimer | Search
Staff only: CSM computers | who, what, where? | news
 
URL: http://www.csm.ornl.gov/evaluation/PHOENIX/PSTSWM-code.CRAYX1.html
Updated: Saturday, 01-Nov-2003 20:02:08 EST

webmaster