| home | about us | contact | ||||
![]() |
| |||
| CSM Home | |||||||||||||||||||||||||||||||||
|
PSTSWM on the Cray X1IntroductionPSTSWM was originally designed as a testbed for parallel spectral algorithms on the sphere. Its serial performance, especially its memory access pattern, is similar to that of the spectral dynamical core used in the NCAR global atmospheric models. By scaling the problem size (horizontal and vertical resolutions), the effect of the memory subsystem on performance can be ascertained. Running multiple versions of the serial code simultaneously allows us to examine the performance impact of memory contention as well. The X1 performance data was collected by Patrick H. Worley in April, May, and June of 2003. PSTSWM is dominated by floating point multiple/add operations, and can also take advantage of math library FFT routines. (Fortran implementations of the FFT routines were used for the Cray X1 experiments described below.) PSTSWM also exhibits little reuse of operands as it sweeps through the field arrays, making significant demands on the memory subsystem. By these characteristics, the code should run well on vector systems. However, in the standard version of PSTSWM, all array dimensions and loop bounds are determined at runtime. This makes it difficult for the compiler to identify which loops to vectorize or parallelize. Architecture SummaryThe Cray X1 at ORNL is hierarchical in both processor and memory design. The basic building block is the multi-streaming processor (MSP), which is capable of 12.8 GFlops/sec for 64-bit operations. Each MSP is comprised of four single streaming processors (SSPs), each with two 32-stage 64-bit floating point vector units and one 2-way super-scalar unit. The SSP uses two clock frequencies, 800 MHz for the vector units and 400 MHz for the scalar unit. Each SSP is capable of 3.2 GFlops/sec for 64-bit operations. The four SSPs share a 2 MB ``Ecache''.The primary strategy for utilizing the eight vector units of a single MSP is parallelization ("streaming") across outer loops and vectorization within inner loops. However, Cray does support the option of treating each SSP as a separate processor. The Ecache has sufficient single-stride bandwidth to saturate the vector units of the MSP. The Ecache is needed because the bandwidth to main memory is not enough to saturate the vector units without data reuse - memory bandwidth is roughly half the saturation bandwidth. Four MSPs and a flat, shared memory of 16 GB form a Cray X1 node. The memory banks of a node provide 200 GB/s of bandwidth, enough to saturate the paths to the local MSPs and service requests from remote MSPs. Each bank of shared memory is connected to a number of banks on remote nodes, with an aggregate bandwidth of roughly 50 GB/s between nodes. This represents a byte per flop of interconnect bandwidth per computation rate, compared to 0.25 bytes per flop on the Earth Simulator and less than 0.1 bytes per flop expected on an IBM p690 with the maximum number of Federation connections. The collected nodes of an X1 have a single system image. Experiment OverviewResults are presented in graphs that plot the change in processor performance (MFlops/second/processor) as a function of horizontal resolution for a fixed number of vertical levels, either 1, 18, or 66, or as a function of vertical resolution for a fixed horizontal resolution, either T42 or T85. The indices for the computational arrays are (longitude, vertical, latitude), while most of the data dependencies are in the longitude and latitude directions. In consequence, the code spends more time going to memory for larger numbers of vertical levels.Eight horizontal resolutions were used:
Our experiments describe the performance of a
version of PSTSWM that was ported to the Cray X1 in April, 2003.
Profiling was used to indicate the routines that were most important
to optimize, and numerous experiments involving a variety of index and
loop ordering and compiler directives were used to improve
performance of these routines. All changes were local to these
routines, and the global data structures were not altered.
After these optimizations, the same routines were still the most
important, but performance was improved significantly.
|
||||||||||||||||||||||||||||||||
|
ORNL
| Directorate
| CSM
| NCCS
| ORNL Disclaimer
| Search
Staff only: CSM computers | who, what, where? | news |
|||||||||||||||||||||||||||||||||
URL: http://www.csm.ornl.gov/evaluation/PHOENIX/PSTSWM-overview.CRAYX1.html Updated: Friday, 19-Sep-2003 13:07:55 EDT webmaster |
|||||||||||||||||||||||||||||||||