ORNL Compaq Alpha/ IBM SP evaluation


Most of this data was collected summer of 2000.
Last Modified Tuesday, 30-Oct-2001 20:45:05 EST

Oak Ridge National Laboratory (ORNL) is currently performing an in-depth evaluation of the Compaq AlphaServer SC parallel architecture as part of its evaluation of early systems project. The primary tasks of the evaluation are to

The emphasis of the evaluation is on application-relevant studies for applications of importance to DOE. However, standard benchmarks are still important for comparisons with other systems. The results presented here are from standard benchmarks and some custom benchmarks and, as such, represent only one part of the evaluation. A large IBM SP3 at ORNL was used for comparison with the Alpha in the results presented below. The results below are in the following categories:

ARCHITECTURE

Both the Alpha and SP consist of four processors sharing memory on a single node. The following table summarizes the main characteristics of the SP and Alpha.

Specs: Alpha SC SP3 MHz 667 375 memory/node 2GB 2GB L1 64K 64K L2 8MB 8MB peak Mflops 2*MHz 4*MHz peak mem BW 5.2GBs 1.6GBs alpha 2 buses @ 2.6 GBs each For the Alpha, nodes are interconnected with a Quadrics switch organized as a fat tree. The SP nodes are interconnected with cross-bar switches in an Omega-like network.

BENCHMARKS

We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the Alpha SC cluster. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.

Alpha: -O4 -fast -arch ev6 SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000 Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP. We also compared performance with the vendor runtime libraries, cxml (Alpha) and essl (SP). We used the following benchmarks in our tests: For both the Alpha and the SP, gettimeofday() provides microsecond wall-clock time (though one has to be sure MICROTIME option is set in the Alpha OS kernel). Both have high-resolution cylce counters as well, but the Alpha cycle counter is only 32-bits so rolls over in less than 7 seconds. For distributed benchmarks (MPI), both systems provide a hardware synchronized MPI_Wtime() with microsecond resolution. On the Alpha, MPI_Wtime is frequency synchonized, but initial offsets are only approximate. (On the Alpha, it appears MPI_Init tries to provide an initial zero offset to the Elan counters on each node when an MPI job starts. On the SP, we discovered several nodes that were not synchronized, a patch was eventually provided.)

LOW LEVEL BENCHMARKS The following table compares the performance of the Alpha and SP for basic CPU operations. These numbers are from the first 14 kernels of EuroBen's mod1ac. The 14th kernel is a rough estimate of peak FORTRAN performance since it has a high re-use of operands.

alpha sp broadcast 516 368 copy 324 295 addition 285 186 subtraction 288 166 multiply 287 166 division 55 64 dotproduct 609 655 X=X+aY 526 497 Z=X+aY 477 331 y=x1x2+x3x4 433 371 1st ord rec. 110 107 2nd ord rec. 136 61 2nd diff 633 743 9th deg. poly 701 709 basic operations (Mflops) euroben mod1ac

The following table compares the performance of various intrinsics (EuroBen mod1f). For the SP, it also shows the effect of -O4 optimization versus -O3.

alpha sp -O4 sp -O3 x**y 8.3 1.8 1.6 sin 13 34.8 8.9 cos 12.8 21.4 7.1 sqrt 45.7 52.1 34.1 exp 15.8 30.7 5.7 log 15.1 30.8 5.2 tan 9.9 18.9 5.5 asin 13.3 10.4 10.2 sinh 10.7 2.3 2.3 instrinsics (Mcalls/s) euroben mod1f (N=10000) The following table compares the performance (Mflops) of a simple FORTRAN matrix (REAL*8 400x400) multiply compared with the performance of DGEMM from the vendor math library (-lcxml for the Alpha, -lessl for the SP). Also the Mflops for 1000x1000 Linpack are reported from netlib. alpha sp ftn 71.7 45.2 lib 1181.5 1320.5 linpack 1031 1236 In the following graph, the performance of the ATLAS DGEMM is compared with the vendor libraries.

The following table compares optimized FORTRAN performance for Euroben mod2a, matrix-vector dot product and product.

-------------------------------------------------------------- alpha sp alpha sp Problem size| MxV-ddot | MxV-ddot | MxV-axpy | MxV-axpy | m | n | (Mflop/s) | (Mflop/s) | (Mflop/s) | (Mflop/s) | ------------------------------------------------------------- 100 | 100 | 411.7 | 423.9 | 101.9 | 401.9 | 200 | 200 | 442.3 | 416.8 | 227.4 | 421.1 | 500 | 500 | 66.1 | 18.7 | 205.4 | 411.9 | 1000 | 1000 | 31.8 | 17.1 | 205.6 | 274.5 | 2000 | 2000 | 27.5 | 16.1 | 66.9 | 207.9 | --------------------------------------------------------------

The following table compares the single processor performance (Mflops) of the Alpha and SP for the Euroben mod2g, a 2-D Haar wavelet transform test.

|----------------------------------------- | Order | alpha | SP | | n1 | n2 | (Mflop/s) | (Mflop/s) | |----------------------------------------- | 16 | 16 | 142.56 | 79.629 | | 32 | 16 | 166.61 | 96.690 | | 32 | 32 | 208.06 | 115.43 | | 64 | 32 | 146.16 | 108.74 | | 64 | 64 | 111.46 | 111.46 | | 128 | 64 | 114.93 | 101.49 | | 128 | 128 | 104.46 | 97.785 | | 256 | 128 | 86.869 | 64.246 | | 256 | 256 | 71.033 | 44.159 | | 512 | 256 | 65.295 | 41.964 | |----------------------------------------- The following plots the performance (Mflops) of Euroben mod2b, a dense linear system test, for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl).

The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, for both optimized FORTRAN and using the BLAS from the vendor library. For the Alpha, -O4 optimization failed, so this data uses -O3.

The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test.

Memory performance

Both the SP and the Alpha have 64 KB L1 caches and 8 MB L2 caches. The following figure shows the data rates for a simple FORTRAN loop to load ( y = y+x(i)), store (y(i)=1), and copy (y(i)=x(i)), for different vector sizes. Data is also included for four threads.

At the tail end of the graph above, the program starts fetching data from main memory. For load, a single Alpha thread is reading data at 1.7 GBs, the SP at 787 MBs. For four threads, the load per-cpu rate drops to 811 MBs for the Alpha and 322 MBs for the SP. The aggregate rate for 4 CPUs from the test is then 3.2 GBs for the Alpha compared to 1.3 GBs for the SP.

The stream benchmark is a program that measures main memory throughput for several simple operations. The following table shows the memory data rates for a single processor.

Stream 1 CPU alpha sp3 Function Rate (MB/s) Copy: 1090.6601 598.9804 Scale: 997.5083 576.2223 Add: 1058.0155 770.8110 Triad: 1133.4106 780.0816 stream (C) memory throughput The aggregate data rate for multiple threads is reported in the following table (input arguments: threads*2000000,0,10). Recall, that the "peak" data rate for the Alpha is 5.2 GBs and for the SP is 1.6 GBs. copy scale add triad ddot x+y alpha1 1339 1265 1273 1383 1376 1115 alpha2 1768 1711 1839 1886 1852 1729 alpha3 2279 2280 2257 2308 2526 1931 alpha4 2375 2323 2370 2427 3098 2125 SP 1 523 561 581 583 1080 729 SP 2 686 797 813 909 1262 923 SP 3 833 805 897 914 1282 942 SP 4 824 799 889 914 1272 927 stream (f90/omp) multiple threads (aggregate MB/sec) IBM provided a modifed parallel stream.f that allocates the memory a little differently and uses some pre_load IBM directives. This version gets improved performance as indicated in the following table. copy scale add triad SP 1 827 799 862 891 SP 2 869 824 886 926 SP 3 891 822 878 918 SP 4 864 809 880 918

The following figure shows the Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT. Data access is irregular, but cache boundaries are still apparent.

The hint benchmark measures computation and memory efficiency as the problem size increases. The following graph shows the performance of a single processor for the Alpha (66.9 MQUIPS) and SP (27.3 MQUIPS). The L1 and L2 cache boundaries are visible.

The lmbench benchmark measures various UNIX and system characeristics. Here are some preliminary numbers for runs on a service and compute node of alpha and SP. The cache/memory latencies reported by lmbench are

alpha sp3 L1 4 5 L2 27 32 memory 210 300 latency in nanoseconds Open/close times are much slower for the Alpha, though file create/delete are faster on the Alpha. EuroBen's mod3a tests matrix computation with file I/O (out of core). The following two tables compare the Alpha and SP. No attempt was made to optimize I/O performance. Mod3a: Out-of-core Matrix-vector multiplication Alpha -------------------------------------------------------------------------- Row | Column | Exec. time | Mflop rate | Read rate | Write rate | (n) | (m) | (sec) | (Mflop/s) | (MB/s) | (MB/s) | -------------------------------------------------------------------------- 25000 | 20000 | 0.56200E-01| 17.793 | 153.63 | 33.945 | 50000 | 20000 | 0.13700 | 14.598 | 117.32 | 35.905 | 100000 | 100000 | 0.67409 | 29.668 | 141.19 | 35.884 | 250000 | 100000 | 2.6982 | 18.531 | 117.61 | 35.770 | -------------------------------------------------------------------------- SP -------------------------------------------------------------------------- 25000 | 20000 | .81841 | 1.2219 | 244.76 | .27172 | 50000 | 20000 | 1.6479 | 1.2136 | 244.61 | .26217 | 100000 | 100000 | 1.4766 | 13.544 | 241.12 | .84673 | 250000 | 100000 | 3.6024 | 13.879 | 239.51 | 1.1294 | -------------------------------------------------------------------------- Others have made more rigorous tests of the regular and parallel file systems.

SHARED-MEMORY BENCHMARKS

Both the Alpha and SP consist of a cluster of shared-memory nodes, each node with four processors sharing a common memory. We tested the performance of a shared-memory node with various C programs with explicit thread calls and with FORTRAN Open MP codes.

The following table shows the performance of thread/join in C as the master thread creates two, three, and four threads. The test repeatedly creates and joins threads.

threads alpha SP 2 47.7 96 3 165 152 4 251 222 thread create/join time in microseconds (C) Often, it is more efficient to create the threads once, and then provide them work as needed. I suspect this is what FORTRAN Open MP is doing for "parallel do". The following table is the performance of parallel do. threads alpha SP 2 2.1 12.7 3 3.4 15.3 4 5.2 19.5 OPEN MP parallel DO (us) Notice that the performance is much better than the explicit thread calls.

The following table shows the time required to lock-unlock using pthread_mutex_lock with various number of threads.

threads alpha sp 1 0.26 0.57 2 1.5 1.7 3 17.8 7.6 4 29.6 15.6 time for lock/unlock (us) The following table compares the performance of simple C barrier program using a single lock and spinning on a shared variable along with pthread_yield. A version based on condition variables was an order of magnitude slower. threads alpha sp 1 0.25 0.6 2 1.36 4.4 3 9.9 20.5 4 65 353 C barrier times (us) The following table illustrates linear speedup for an embarrassingly parallel integration. A C code with explicit thread management is compared with FORTRAN Open MP. Both just used -O optimization. fortran C threads alpha SP alpha SP 1 252 102 166 52 2 502 204 331 104 3 748 306 496 157 4 990 408 657 206 rectangle rule (Mflops) -O optimization The following table illustrates an explicit thread implementation of Cholesky factorization of a 1000x1000 double precision matrix in C (-O optimization). threads alpha sp 1 150 125 2 269 238 3 369 353 4 435 390 cholp 1k matrix factor (mflops) -O optimization The following table compares FORTRAN OpenMP for the Alpha and SP doing a simple Jacobi iteration. Note that the SP slows for 4 threads. problem size 10K 250K 1M threads alpha sp alpha sp alpha sp 1 4308 3656 175 114 27 17 2 8262 5707 342 284 42 27 3 11603 7048 503 421 50 41 4 14109 4690 655 324 61 41 iterations per second

MESSAGE-PASSING BENCHMARKS

Internode communication can be accomplished with IP, PVM, or MPI. We report MPI performance over the Alpha Quadrics network and the IBM SP. Each node (4 CPUs) share a single network interface. However, each CPU is a unique MPI end point, so one can measure both inter-node and intra-node communication. The following table summarizes the measured communication characteristics of the Alpha and the SP.

alpha sp3 latency (1 way, us) 5.4 16.3 bandwidth (echo, MBs) 199 139 (exchange, MBs) 167 180 MPI within a node 622 512 latency (min, 1 way, us) and bandwidth (MBs) -- latency Bandwidth (min 1 way us, MBs) alpha node 5.5 198 alpha cpu 5.8 623 alpha IP-sw 123 77 alpha IP-gigE/1500 76 44 alpha IP-100E 70 11 sp node 16.3 139 sp cpu 8.1 512 sp IP-sw 82 46 sp IP-gigE/1500 91 47 sp IP-gigE/9000 136 84 sp IP-100E 93 12

For comparisons, the communication performance of the Alpha and SP3 are plotted in the following graph from Dongarra and Dunigan, ``Message-Passing Performance of Various Computers' (1997). We also measured the shem_put latency between two alpha nodes to be 3.2 microseconds for 4 bytes.

The following graph shows the bandwidth for communication between two nodes using MPI. Data is from both EuroBen's mod1h and ParkBench comms1.

The following graph shows bandwidth for communication between two processors on the same node using MPI. The SP performs better for smaller messages.

In the following, we just plot Alpha intra-node with inter-node bandwidth. Notice that for small message sizes on the ALPHA, it is faster to pass messages between nodes than between two cpu's on the same node.

We also measured the jitter in round-trip times on the Alpha and SP between neighboring nodes and distant nodes. The Alpha shows less variation in round-trip times than the SP. A test between node 1 and node 150 on the SP also shows slightly longer round-trip time and increased jitter. Click here to see a plot of the jitter. Jitter between node 1 and node 64 on our Alpha was not noticably different than between two adjacent nodes.

We measured the bidirectional bandwidth between nodes (and intra-node) using MPI_Sendrecv. (MPI_Irecv produced the same performance.) The interconnect fabrics for the Alpha and the SP support full bidirectional bandwidth, but as illustrated in the following graph, software and NIC's limit measured performance. For the Alpha, the exchange bandwidth (167 MBs) is less than the unidirectional bandwdith (200 MBs). As noted before, Alpha's internode performance surpasses intranode performance for small messages. For large messages, the SP outperforms the Alpha for internode exchanges.

We measured the bisection bandwidth between nodes by streaming one megabyte messages from the lower half of the nodes to the upper half. The following table shows that the SP shows some contention as more nodes participate. (Caution, these results have not met our quality assurance standards yet... lot of variation in SP numbers with other jobs running?)

per node pair aggregate nodes alpha sp alpha sp 2 195.8 138.8 196 139 4 194.2 138.7 388 277 8 188.2 138 753 552 16 188.3 132 1560 1056 32 179.8 130 2877 2080 48 169.3 128 4063 3072 56 157.1 112 4399 3136 64 171.1 124 5475 3968 average bisection bandwith (MBs) preliminary Since all four processors on node share one network interface and since one processor can saturate the network interface, multiple processors sending concurrently off the node will only get a portion of the available bandwidth. Two processors sending to the other two processors on the same node get an aggregate throughput of 727 MBs on the alpha (914 MBs on the sp).

The following table shows the performance of aggregate communication operations (barrier, broadcast, sum-reduction) using one processor per node (N) and all four processors on each node(n). Times are in microseconds.

mpibarrier (average us) cpus alpha-N alpha-n sp-N sp-n 2 7 11 22 10 4 7 16 45 20 8 8 18 69 157 16 9 21 93 230 32 11 28 118 329 64 37 145 419 mpibcast (8 bytes) cpus alpha-N alpha-n sp-N sp-n 2 9.6 12.5 5.4 6.7 4 10.4 20.3 9.4 9.4 8 11.4 28.5 13.4 17.5 16 12.5 32.9 17.0 20.9 32 13.8 41.4 19.3 24.1 64 48.7 23.6 30.8 mpireduce (SUM, doubleword) cpus alpha-N alpha-n sp-N sp-n 2 9 11 8 9 4 190 207 29 133 8 623 350 271 484 16 1117 604 683 1132 32 3176 1991 1613 2193 64 5921 2841 3449

PARALLEL KERNEL BENCHMARKS

Both ParkBench and EuroBen (euroben-dm) had MPI-based parallel kernels. However, the euroben-dm communication model was to have the processes do all of their send's before issuing receive's. On the SP, this model resulted in deadlock for the larger problem sizes. The EAGER_LIMIT can be adjusted to make some progress on the SP but the deadlocks could not be completely eliminated. MPI buffering on the Alpha was adequate. The maximum MPI buffering on an SP node was 64 MBytes, on the Alpha, 191 MBytes.

The following table show MPI parallel performance of the LU benchmark (64x64x64) for the Alpha and SP. The first column pair is one processor per node, the second pair is using all four processors per node. These tests used standard FORTRAN (no vendor libraries).

Nodes CPUs alpha sp alpha sp 2 786.08 617.98 762.92 588.16 4 1708.1 1387.03 1604.05 1188.02 8 3384.03 2561.97 3265.83 2473.80 16 6190.89 5593.18 5556.02 4771.66 aggregate Mflops Results for the FT benchmarks follow Nodes CPUs alpha sp alpha sp 4 633 465 580 307 8 1198 925 849 553 16 2221 1890 1019 1056 aggregate Mflops Results for the NAS SP benchmark follow. Nodes CPUs alpha sp alpha sp 4 877 632 734 416 9 2310 1623 1837 1225 16 4344 2920 3143 2252 aggregate Mflops The following plots the aggregate Mflop performance for ParkBench QR factorization of 1000x1000 double precision matrix. One can compare the performance of optimized FORTRAN versus the vendor libraries (cxml/essl), and the difference in performance when using all processors on a node versus just one processor (N) per node.

Links/References

ORNL's Pat Worley's alpha evaluation and CRM performance on SP and ALPHA (lots of sqrt's) PSTWM performance
ORNL CCS pages for Alpha SC and SP3
UT student, Jay Patel's results July, 2000
Compaq alplha es40 cluster info and EV6 chip paper and alpha 21624 hardware ref. and compiler writer's guide
Compaq's alpha server performance info
Alpha's quadrix switch or older Meiko fat-tree network
IBM large scale system info white papers
IBM papers on POWER3 and here
RS6000 switch performance and SP2 architecture paper and other sp2 articles
power3 tutorial and IBM SP scientific redbook
AIX/SP thread tuning and poe/aix envirnoment variables
peak performance for Power 3
power 3 high nodes
IBM's essl scientific library and pessl parallel essl and mass intrinsics and other optimization libraries mass, mpi, lapi, essl
ParkBench or euroben or NAS parallell benchmarks or hint
stream benchmark and splash and lmbench and mpio benchmarks
PDS: The Performance Database Server linpack and such
hpl high perf linpack for distributed memory and ATLAS
benchmarks papers
atlas
openmp and NASPB on OpenMP and PBN source
openmp microbenchmarks
SPEC
UT's papi performance counter API
Heller's rabbit
Monitoring Application Performance Using Hardware Counters
cpu timers japanese
ANL's MPICH performance

Research Sponsors

The acquisition of the Compaq systems and the evaluation research is funded by the Mathematical, Information, and Computational Sciences Division, within the Office of Advanced Scientific Computing Research of the Office of Science, Department of Energy. The application-specific evaluations are also supported by the sponsors of the individual applications research areas.

Last Modified Tuesday, 30-Oct-2001 20:45:05 EST thd@ornl.gov (touches: 195197 )
back to Tom Dunigan's page or the ORNL home page