ORNL Opteron Evaluation


.... this is work in progress.... last revised Tuesday, 01-Feb-2005 11:58:28 EST

Revised for 1.8 GHz Opteron 4/9/04. Results for older opteron 1.6GHz

The results presented here are from standard benchmarks and some custom benchmarks and, as such, represent only one part of the evaluation. An IBM SP4, IBM Winterhawk II (noted as SP3 in the following tables and graphs),Cray X1, SGI Altix (Itanium 2), and Compaq Alpha ES40 at ORNL were used for comparison with the Opteron. The results below are in the following categories:

ARCHITECTURE

The Opteron has 4 cpu's and 16 GB of memory (1.8 GHz). The X1 at ORNL has 8 nodes. Each node has 4 MSPs, each MSP has 4 SSPs, and each SSP has two vector units. The Cray "processor/CPU" in the results below is one MSP. All 4 MSP's on a node share memory. The Power4 consists of one node with 32 processors sharing memory. Both the Alpha and SP3 consist of four processors sharing memory on a single node. All memory is accessible on the Altix. The following table summarizes the main characteristics of the machines

Specs: Alpha SC SP3 SP4 X1 Opteron Altix MHz 667 375 1300 800 1800 1500 memory/node 2GB 2GB 32GB 16GB 16GB 512GB L1 64K 64K 32K 16K 64k 32K L2 8MB 8MB 1.5MB 2MB 1MB 256K L3 128MB 6MB peak Mflops 2*MHz 4*MHz 4*MHz 12.8 2*MHz 4*MHz peak mem BW 5.2GBs 1.6GBs 200+ GBs ?200+GBS 5.3GBs 6.4 GBs alpha 2 buses @ 2.6 GBs each X1 memory bandwidth is 34 GB/s/CPU. For the Alpha, nodes are interconnected with a Quadrics switch organized as a fat tree. The SP nodes are interconnected with cross-bar switches in an Omega-like network. The X1 uses a modified 2-D torus.

BENCHMARKS

We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the X1. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.

opteron: -O3 (pgf90 v5.1)) X1: -Oaggress,stream2 (arpun -n xxx -p 64k:16m a.out) Alpha: -O4 -fast -arch ev6 SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000 Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP. We also compared performance with the vendor runtime libraries, sci(X1), cxml (Alpha) and essl (SP). We used the following benchmarks in our tests: For both the Alpha and the SP, gettimeofday() provides microsecond wall-clock time (though one has to be sure MICROTIME option is set in the Alpha OS kernel). Both have high-resolution cylce counters as well, but the Alpha cycle counter is only 32-bits so rolls over in less than 7 seconds. For distributed benchmarks (MPI), the IBM and Alpha systems provide a hardware synchronized MPI_Wtime() with microsecond resolution. On the Alpha, MPI_Wtime is frequency synchonized, but initial offsets are only approximate. (On the Alpha, it appears MPI_Init tries to provide an initial zero offset to the Elan counters on each node when an MPI job starts. On the SP3, we discovered several nodes that were not synchronized, a patch was eventually provided.) Time is not syncrhonized on the X1.

MEMORY PERFORMANCE

The stream benchmark is a program that measures main memory throughput for several simple operations. The aggregate data rate for multiple threads is reported in the following table. Recall, that the "peak" memory data rate for the X1 is 200 GBs, Alpha is 5.2 GBs, and for the SP3 is 1.6 GBs. Data for the 16-way SP3 (375 Mhz, Nighthawk II) is included too. Data for the Alpha ES45 (1 GHz) is obtained from the streams data base. Data for p690/sp4 is with affinity enabled (6/1/02). The X1 uses (aprun -A). The Opteron is supposed to have greater than 5.3 GB/s/cpu memory bandwidth, we don't see that yet?

MBs copy scale add triad opteron 1975 1747 1945 2018 2 cpus 2539 2623 2892 3117 4 cpus 4848 5000 5714 6316 altix 3214 3169 3800 3809 X1 22111 21634 23658 23752 alpha1 1339 1265 1273 1383 es45-1 1946 1941 1978 1978 SP3 1 523 561 581 583 SP3/16-1 486 494 601 601 SP4-1 1774 1860 2098 2119 From AMD's published spec benchmark and McCalpin's suggested conversion of 171.swim results to triad memory bandwidth, we get 2.7 GB/s memory bandwidth for one Opteron processor.

The MAPS benchmark also characterizes memory access performance. Plotted are load/store bandwidth for sequential (stride 1) and random access.

The tabletoy benchmark (C) makes random writes of 64-bit integers in a shared memory, parallelization is permitted with possibly non-coherent updates. The X1 number is for vectorizing the inner loop (multistreaming was an order of magnitude slower 88 MBs). Data rate in the following table is for a 268MB table. We include multi-threaded altix, opteron, sp3 (NERSC), and sp4 data as well. Revised 2/9/04

MBs (using wallclock time) sp4-1 26 altix-1 42 X1-msp-1 1190 opteron-1 36 sp3-1 8 sp4-2 47 altix-2 45 opteron-2 65 sp3-2 26 sp4-4 98 altix-4 62 opteron-4 102 sp3-4 53 sp4-8 174 altix-8 86 sp3-8 90 sp4-16 266 altix-16 69 sp3-16 139 sp4-32 322 altix-32 77

The hint benchmark measures computation and memory efficiency as the problem size increases. (This is C hint version 1, 1994.) The following graph shows the performance of a single processor for the Opteron (176 MQUIPS), X1 (12.2 MQUIPS), Alpha (66.9 MQUIPS), Altix (88.2 MQUIPS), and SP4 (74.9 MQUIPS). The L1 and L2 cache boundaries are visible, as well as the Altix and SP4's L3.

Here results from LMbench.

LOW LEVEL BENCHMARKS (single processor)

The following table compares the performance of the X1, Alpha, and SP for basic CPU operations. These numbers are from the first few kernels of EuroBen's mod1ac. The 14th kernel (9th degree poly) is a rough estimate of peak FORTRAN performance since it has a high re-use of operands. (Revised 2/9/04)

alpha sp3 sp4 X1 opteron Altix broadcast 516 368 1946 2483 882 2553 copy 324 295 991 2101 591 1758 addition 285 186 942 1957 589 1271 subtraction 288 166 968 1946 589 1307 multiply 287 166 935 2041 590 1310 division 55 64 90 608 223 213 dotproduct 609 655 2059 3459 593 724 X=X+aY 526 497 1622 4134 884 2707 Z=X+aY 477 331 1938 3833 884 2632 y=x1x2+x3x4 433 371 2215 3713 1319 2407 1st ord rec. 110 107 215 48 143 142 2nd ord rec. 136 61 268 46 208 206 2nd diff 633 743 1780 4960 1313 2963 9th deg. poly 701 709 2729 10411 1197 5967 basic operations (Mflops) euroben mod1ac

The following table compares the performance of various intrinsics (EuroBen mod1f). For the SP, it also shows the effect of -O4 optimization versus -O3. (Revised 2/9/04)

alpha sp3 -O4 sp3 -O3 sp4 -O4 X1 opteron altix x**y 8.3 1.8 1.6 7.1 49 6.0 13.2 sin 13 34.8 8.9 64.1 97.9 26.7 22.9 cos 12.8 21.4 7.1 39.6 71.4 17.7 22.9 sqrt 45.7 52.1 34.1 93.9 711 74.6 107 exp 15.8 30.7 5.7 64.3 355 21.8 137 log 15.1 30.8 5.2 59.8 185 31.8 88.5 tan 9.9 18.9 5.5 35.7 85.4 21.8 21.1 asin 13.3 10.4 10.2 26.6 107 13.4 29.2 sinh 10.7 2.3 2.3 19.5 82.6 14.8 19.1 instrinsics (Mcalls/s) euroben mod1f (N=10000) The following table compares the performance (Mflops) of a simple FORTRAN matrix (REAL*8 400x400) multiply compared with the performance of DGEMM from the vendor math library (-lcxml for the Alpha, -lsci for the X1, -lessl for the SP). Note, the SP4 -lessl (3.3) is tuned for the Power4. Also the Mflops for 1000x1000 Linpack are reported from netlib except the sp4 number is from IBM. (Revised 2/9/04) alpha sp3 sp4 X1 opteron altix ftn 72 45 220 7562 287 228 lib 1182 1321 3174 9482 2610 5222 linpack 1031 1236 2894 3955 The following plot compares the performance of the scientific library DGEMM. We also compare libgoto AMD's -lacml library. (Revised 2/9/04).

The following plot compares the DAXPY performance of the Opteron and Itanium (Altix).

The following table compares the single processor performance (Mflops) of the Alpha and IBMs for the Euroben mod2g, a 2-D Haar wavelet transform test. (Revised 2/9/04)

|-------------------------------------------------------------------------- | Order | alpha | altix | SP4 | X1 | opteron | | n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s)|(Mflop/s)| |-------------------------------------------------------------------------- | 16 | 16 | 142.56 | 150.4 | 126.42 | 10.5 | 341 | | 32 | 16 | 166.61 | 192.1 | 251.93 | 13.8 | 383 | | 32 | 32 | 208.06 | 262.3 | 301.15 | 20.0 | 437 | | 64 | 32 | 146.16 | 252.7 | 297.26 | 22.7 | 437 | | 64 | 64 | 111.46 | 242.5 | 278.45 | 25.9 | 387 | | 128 | 64 | 114.93 | 295.6 | 251.90 | 33.3 | 342 | | 128 | 128 | 104.46 | 350.2 | 244.45 | 48.5 | 264 | | 256 | 128 | 86.869 | 211.2 | 179.43 | 45.8 | 198 | | 256 | 256 | 71.033 | 133.3 | 103.52 | 46.7 | 129 | | 512 | 256 | 65.295 | 168.7 | 78.435 | 52.1 | 99 | |-------------------------------------------------------------------------- The following plots the performance (Mflops) of Euroben mod2b, a dense linear system test, for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl).

The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, for both optimized FORTRAN and using the BLAS from the vendor library. For the Alpha, -O4 optimization failed, so this data uses -O3.

The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test. (Revised 2/9/04)

The following figure shows the FORTRAN Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT. Data access is irregular, but cache effects are still apparent. (Revised 2/9/04).

The following compares a 1-D FFT using the FFTW benchmark.

The following graph plots 1-D FFT performance using the vendor library (-lacml, -lscs, -lsci or -lessl), initialization time is not included. Revised 2/9/04

SHARED-MEMORY BENCHMARKS
Both the Alpha and IBMs consist of a cluster of shared-memory nodes, each node with four processors sharing a common memory (16 for X1 and 32 for sp4). The X1 is cache-coherent within a node, but the memory space is global across all nodes. The Opteron shares memory among 4 processors. We tested the performance of a shared-memory node with various C programs with explicit thread calls and with FORTRAN Open MP codes.

The X1 pthreads model permits up to 16 SSP threads (-h ssp) or 4 MSP threads where each thread can also be multithreaded on each MSP. The following table shows the performance of thread/join in C as the master thread creates two, three, and four threads. The test repeatedly creates and joins threads. Revised 2/24/04.

threads alpha sp3 sp4 altix x1 opteron 2 47.7 96 44 399 29695 41 3 165 152 68 842 55439 82 4 251 222 97 1241 79180 126 thread create/join time in microseconds (C) Often, it is more efficient to create the threads once, and then provide them work as needed. I suspect this is what FORTRAN Open MP is doing for "parallel do". The following table is the performance of parallel do. threads alpha sp3 sp4 altix x1 2 2.1 12.7 6.3 4.8 12.1 3 3.4 15.3 8.4 6.3 13.2 4 5.2 19.5 9.5 6.5 17.4 OPEN MP parallel DO (us)

The following table shows the time required to lock-unlock using pthread_mutex_lock with various number of threads. For the IBMs we use setenv SPINLOOPTIME 5000.

threads alpha sp3 sp4 altix x1 opteron 1 0.26 0.6 0.3 0.07 4.9 0.06 2 1.5 1.4 1.3 2.6 295 11.5 3 17.8 2.1 1.6 41.5 1317 17.5 4 29.6 2.9 3.8 73.2 1703 24.1 time for lock/unlock (us) The graph to the right shows the time to lock/unlock a systemV semaphore when competing with other processors.

The following table compares the performance of simple C barrier program using a single lock and spinning on a shared variable along with pthread_yield.

threads alpha sp3 sp4 altix x1 opteron 1 0.25 0.6 0.3 0.08 4.9 0.05 2 1.36 4.4 1.9 0.5 97.7 3.3 3 9.9 20.5 3.1 18.1 95.2 10.6 4 65 34.6 3.7 53.2 99.3 2.4 C barrier times (us) The following table illustrates linear speedup for an embarrassingly parallel integration. A C code with explicit thread management is compared with FORTRAN Open MP. Both just used -O optimization. The CRAY C pthread implementation does not scale well for SSPs or MSPs. fortran OpenMP C threads alpha sp3 sp4 altix x1 | alpha sp3 sp4 altix x1 msp opteron 1 252 102 251 891 676 | 166 52 216 558 669 2480 311 2 502 204 501 1775 1354 | 331 104 432 1114 1154 2957 621 3 748 306 752 2312 2026 | 496 157 648 1668 1513 2884 921 4 990 408 1002 3519 2695 | 657 206 864 2221 1732 2805 1214 8 1999 6815 5336 | 1725 4410 1789 16 3565 12039 10470 | 3429 8580 1254 rectangle rule (Mflops) -O optimization The following table illustrates an explicit thread implementation of Cholesky factorization of a 1000x1000 double precision matrix in C (-O optimization). threads alpha sp3 sp4 altix x1 msp opteron 1 150 125 350 196 525 476 285 2 269 238 631 341 848 733 481 3 369 353 1007 512 1096 942 552 4 435 390 1306 621 797 1087 722 cholp 1k matrix factor (mflops) -O optimization
LINKS

Opteron architecture
NWCHEM DFT performance
AMD opteron benchmarks
AMD's ACML library or here or Opteron library libgoto
Opteron bios and kernel developer's guide
papi
hpet high precision timers and rdtsc timers and dclock
processor for Sandia's Red Storm