Oak Ridge National Laboratory installed a 32 processor Cray X1 on
March 18, 2003. The system grew to 256 processors in October, 2003,
and to 512 processors in May, 2004.
The following results represent achievable lower bounds, in
that the operating system, compilers, communication libraries, and
math libraries are all undergoing active development. (The
impact of these changes are demonstrated in a number of the following
performance studies.) Please note that the performance of a number of
applications is significantly better in the more recent talks and
papers than in the older talks and papers.
ORNL Papers
Cray X1 Evaluation Status Report
(PDF),
P. A. Agarwal, et al.
in Proceedings of the 46th Cray User Group Conference,
Knoxville, TN, May 17-20, 2004.
Adventures in Vectorizing the Community Land Model
( PDF)
F. M. Hoffman, M. Vertenstein, H. Kitabata, J. B. White III, P. H. Worley, J. B. Drake, M. Cordery,
in Proceedings of the 46th Cray User Group Conference,
Knoxville, TN, May 17-21, 2004. (186380 bytes)
Experience with the Full CCSM
(PDF),
J. B. Drake, P. H. Worley, I. Carpenter, M. Cordery,
in Proceedings of the 46th Cray User Group Conference,
Knoxville, TN, May 17-21, 2004. (192691 bytes)
GYRO: Analyzing New Physics in Record Time
(PDF),
M. R. Fahey, J. Candy,
in Proceedings of the 46th Cray User Group Conference,
Knoxville, TN, May 17-20, 2004.
The Performance Evolution of the Parallel Ocean Program on the Cray X1
(
PDF),
P. H. Worley, J. Levesque,
in Proceedings of the 46th Cray User Group Conference,
Knoxville, TN, May 17-20, 2004.
Cray X1 Evaluation Status Report
(PDF),
P. A. Agarwal, R. A. Alexander, E. Apra, S. Balay, A. S. Bland, J.Colgan, E. F. D'Azevedo, J. J. Dongarra, T. H. Dunigan, Jr.,
M. R. Fahey, R. A. Fahey, A. Geist, M. Gordon, R. J. Harrison, D. Kaushik, M. Krishnakumar, P. Luszczek, B. Messer, A. Mezzacappa,
J. A. Nichols, J. Nieplocha, L. Oliker, T. Packwood, M. S. Pindzola, T. C. Schulthess, J. S. Vetter, J. B. White, III, T. L. Windus,
P. H. Worley, T. Zacharia,
ORNL Technical Report ORNL/TM-2004/13,
January, 2004.
Early Evaluation of the Cray X1
(
PDF),
T. H. Dunigan, Jr., M. R. Fahey, J. B. White III, P. H. Worley,
in Proceedings of the ACM/IEEE Conference on High Performance
Networking and Computing (SC03),
Phoenix, AZ, November 15-21, 2003.
Early Operations Experience with the Cray X1 at the Oak Ridge National Laboratory Center for Computational Sciences
(PDF) ,
A. S. Bland, R. Alexander, S. M. Carter, K. D. Matney,
in Proceedings of the 45th Cray User Group Conference,
Columbus, OH, May 12-16, 2003.
DOE Ultrascale Evaluation Plan of the Cray X1
(PDF),
M. R. Fahey, J. B. White III,
in Proceedings of the 45th Cray User Group Conference,
Columbus, OH, May 12-16, 2003.
An Optimization Experiment with the Community Land Model on
the Cray X1
(
PDF),
J. B. White III,
in Proceedings of the 45th Cray User Group Conference,
Columbus, OH, May 12-16, 2003.
Early Evaluation of the Cray X1 at Oak Ridge National Laboratory
(
PDF),
P. H. Worley, T. H. Dunigan,
in Proceedings of the 45th Cray User Group Conference,
Columbus, OH, May 12-16, 2003.
Cray X1 Evaluation
(PDF),
A. S. Bland, et al,
ORNL Technical Report ORNL/TM-2003/67,
March, 2003.
ORNL Presentations
GYRO: Analyzing New Physics in Record Time
(PDF),
M. R. Fahey, J. Candy,
46th Cray User Group Conference,
Knoxville Marriott,
Knoxville, Tennessee,
May 20, 2004.
A Progress Report on the Cray X1 A Progress Report on the Cray X1 Evaluation by CCS at ORNL
(PDF),
J. S. Vetter,
46th Cray User Group Conference,
Knoxville Marriott,
Knoxville, Tennessee,
May 18, 2004.
The Performance Evolution of the Parallel Ocean Program on the Cray X1
(
HTML
PDF),
P. H. Worley, J. Levesque,
46th Cray User Group Conference,
Knoxville Marriott,
Knoxville, Tennessee,
May 18, 2004.
Cray X1 Optimization: A Customer's Perspective,
(HTMLPDF)
P. H. Worley,
46th Cray User Group Conference,
Knoxville Marriott,
Knoxville, Tennessee,
May 18, 2004
Cray X1 Evaluation: Overview and Scalability Analysis
(
HTML),
P. H. Worley,
SIAM Conference on Parallel Processing for Scientific Computing 2004,
Hyatt at Fisherman's Wharf,
San Francisco, California,
February 26, 2004.
(Minor variations of this talk were also given at the
ANL/ORNL Site Visit of the National Research Council/National Academies
Computer Science and Telecommunications Board,
Argonne National Laboratory,
Argonne, Illinois,
March 2, 2004
and at the
Cray X1 Review,
Oak Ridge National Laboratory,
Oak Ridge, Tennessee,
February 10, 2004.)
Early Evaluation of the Cray X1
(
HTML),
P. H. Worley,
SC2003,
Phoenix Convention Center,
Phoenix, Arizona,
November 19, 2003.
Scalable Supercomputer Solving Superconductivity
(PDF),
J. B. White III,
SC2003,
Cray Exhibit Booth,
Phoenix Convention Center,
Phoenix, Arizona,
November 19, 2003.
Early Evaluation of the Cray X1 - Part 1.5
(
HTML),
P. H. Worley,
SC2003,
Cray Exhibit Booth,
Phoenix Convention Center,
Phoenix, Arizona,
November 18, 2003.
Latest Performance Results from ORNL: Cray X1 and SGI Altix
(
HTML),
P. H. Worley,
System and Application Performance Workshop,
2003 LACSI Symposium,
Eldorado Hotel,
Santa Fe, New Mexico,
October 27, 2003.
(A minor variation of this talk was also given at the
Computer Science Department, University of Tennessee,
Knoxville, Tennessee,
January 9, 2004.)
CCSM Component Performance Benchmarking and Status of the Cray X1 at ORNL
(
HTML),
P. H. Worley,
Computing in the Atmospheric Sciences Worksop 2003,
L'Imperial Palace Hotel,
Annecy, France,
September 10, 2003.
Early Operations Experience with the Cray X1 at the Oak Ridge National Laboratory Center for Computational Sciences
(PDF),
A. S. Bland, R. Alexander, S. M. Carter, K. D. Matney,
45th Cray User Group Conference,
Hyatt on Capital Square,
Columbus, Ohio,
May 12-16, 2003.
DOE Ultrascale Evaluation Plan of the Cray X1
(PDF),
M. R. Fahey, J. B. White III,
45th Cray User Group Conference,
Hyatt on Capital Square,
Columbus, Ohio,
May 12-16, 2003.
An Optimization Experiment with the Community Land Model on
the Cray X1
(PDF),
J. B. White III,
45th Cray User Group Conference,
Hyatt on Capital Square,
Columbus, Ohio,
May 12-16, 2003.
Early Evaluation of the Cray X1 at Oak Ridge National Laboratory
(
HTML),
P. H. Worley, T. H. Dunigan,
45th Cray User Group Conference,
Hyatt on Capital Square,
Columbus, Ohio,
May 13, 2003.
Other Papers, Presentations, and Data on Cray X1 Performance
While subject to some interpretation, the PSTSWM results indicate
the following.
Running the code without modifying for vectorization demonstrated
poor performance (never more than 400 MFlops/sec, and typically
less than 250 MFlops/sec).
Modest modifications were sufficient to achieve 4.0 GFlops/sec
in the best case. If the vertical dimension is specified at compile
time, the best performance increases to 6.0 GFlops/sec. The Fourier
transform (coded in Fortran) achieved only 25% of peak. The Legendre
transform (coded in Fortran) performance increases with problem size,
achieving better than 50% of peak for the largest problem sizes when
specifying the vertical dimension at compile-time. These are the
performance critical operations.
Performance increases with problem size, both horizontal and
vertical resolution.
Running instances of the code on all processors of
the SMP node simultaneously showed almost no performance degradation,
unlike all other systems for which we have data.
Assigning processes to SSP processors directly instead of
allowing the compiler to assign work within an MSP has some
performance advantage in some cases, but the compiler does a
reasonable job overall. The PSTSWM data does not indicate that using
SSPs directly is a useful optimization strategy.
System comparisons using a small climate-size problem resolution (T42L18):
a single MSP processor in the Cray X1 SX-6/8 is 2.8 times faster
than a single processor in an IBM p690 and 10% faster than the SX-6.
a 4 processor X1 SMP node has 27% less throughput
than a 32 processor p690 when making simultaneous serial runs
Running the code without modifications achieved
performance similar to that on the IBM p690 cluster, but this was over 5 times
slower than when optimized on the Cray X1.
The optimizations used to port POP to the Earth Simulator work
reasonably well on the X1. Using Co-Array Fortran to decrease the
latency in latency-sensitive algorithms improves performance
further. Modifying the Earth Simulator optimizations to take into
account the Cray X1 architecture should also improve performance.
The latter work is ongoing.
System comparisons using the one degree benchmark problem that
corresponds to how POP is used in coupled climate simulations:
POP on the Cray X1 is more than 5 times faster than the fastest
of the nonvector systems examined (the IBM p690 cluster) for the
same number of processors.
POP on the Cray X1 is 7% slower than POP on the Earth Simulator
for the same number of processors. As optimization is ongoing,
this is likely to change over time.