ORNL IBM Power4 (p690) evaluation



Recent results (also see our recent Cray X1 results and our SGI Altix results) Last Modified Thursday, 21-Apr-2005 07:46:32 EDT
following data collected October, 2001

NOTE The tests on the IBM p690 are using early versions of the compilers and libraries, and we expect performance to continue to improve with new releases. Large page support and and page affinity will be provided soon and will also improve results.

Oak Ridge National Laboratory (ORNL) is currently performing an in-depth evaluation of the IBM Power 4 (p690) system as part of its evaluation of early systems project. The primary tasks of the evaluation are to

The emphasis of the evaluation is on application-relevant studies for applications of importance to DOE. However, standard benchmarks are still important for comparisons with other systems. The results presented here are from standard benchmarks and some custom benchmarks and, as such, represent only one part of the evaluation. A large IBM Winterhawk II (noted as SP3 in the following tables and graphs) and Compaq Alpha ES40 at ORNL were used for comparison with the IBM p690 (noted as SP4 in the tables and graphs). The results below are in the following categories:

ARCHITECTURE

The present Power4 consists of one node with 16 processors (2 MCM's) sharing memory. Both the Alpha and SP3 consist of four processors sharing memory on a single node. The following table summarizes the main characteristics of the machines

Specs: Alpha SC SP3 SP4 MHz 667 375 1300 memory/node 2GB 2GB 32GB L1 64K 64K 32K L2 8MB 8MB 1.5MB L3 128MB peak Mflops 2*MHz 4*MHz 4*MHz peak mem BW 5.2GBs 1.6GBs 200+ GBs ? alpha 2 buses @ 2.6 GBs each For the Alpha, nodes are interconnected with a Quadrics switch organized as a fat tree. The SP3 nodes are interconnected with cross-bar switches in an Omega-like network.

BENCHMARKS

We have used widely available benchmarks in combination with our own custom benchmarks to characterize the performance of the SP4. Some of the older benchmarks may need to be modified for these newer faster machines -- increasing repetitions to avoid 0 elapsed times, increasing problem sizes to test out of cache performance. Unless otherwise noted, the following compiler switches were used on the Alpha and SP.

Alpha: -O4 -fast -arch ev6 SP: -O4 -qarch=auto -qtune=auto -qcache=auto -bmaxdata:0x70000000 Benchmarks were in C, FORTRAN, and FORTRAN90/OpenMP. We also compared performance with the vendor runtime libraries, cxml (Alpha) and essl (SP). We used the following benchmarks in our tests: For both the Alpha and the SP, gettimeofday() provides microsecond wall-clock time (though one has to be sure MICROTIME option is set in the Alpha OS kernel). Both have high-resolution cylce counters as well, but the Alpha cycle counter is only 32-bits so rolls over in less than 7 seconds. For distributed benchmarks (MPI), both systems provide a hardware synchronized MPI_Wtime() with microsecond resolution. On the Alpha, MPI_Wtime is frequency synchonized, but initial offsets are only approximate. (On the Alpha, it appears MPI_Init tries to provide an initial zero offset to the Elan counters on each node when an MPI job starts. On the SP3, we discovered several nodes that were not synchronized, a patch was eventually provided.)

We recently received an essl tuned for the power4. The following graph plots the power3 essl vs the new power4 essl (3.3) for the Euroben mod2b benchmark. The newer essl provides signifcant improvement. Also plotted is the performance of mod2b without essl, comparing arch=pwr3 versus arch=pwr4 for xlf (both run on the power4). The large difference between the "compiled from source" version and the ESSL version is typical -- even the best compilers are quite sensitive to the details of the source code organization for this class of algorithms.

Memory performance

Both the SP3 and the Alpha have 64 KB L1 caches and 8 MB L2 caches. The SP4 has a 32KB L1 (FIFO), a 1.4 L2 (shared between 2 processors), and a 128 MB L3. The following figure shows the data rates for a simple FORTRAN loop to load ( y = y+x(i)), store (y(i)=1), and copy (y(i)=x(i)), for different vector sizes. Data is also included for four threads. (Beware of the linear interpolation between data points, and note we need to extend the test beyond 128 MB to get out of the SP4 L3 cache. It has been suggested the the "dcbz" SP4 instruction that allocates the target cache line in the L2 without loading it from memory first could further improve SP4 performance. Also see McCalpin's stream2 benchmark.)

At the tail end of the graph above, the program starts fetching data from main memory. For load, a single Alpha thread is reading data at 1.7 GBs, the SP3 at 787 MBs. For four threads, the load per-cpu rate drops to 811 MBs for the Alpha and 322 MBs for the SP. The aggregate rate for 4 CPUs from the test is then 3.2 GBs for the Alpha compared to 1.3 GBs for the SP.

The stream benchmark is a program that measures main memory throughput for several simple operations. The aggregate data rate for multiple threads is reported in the following table. Recall, that the "peak" data rate for the Alpha is 5.2 GBs and for the SP3 is 1.6 GBs. Data for the 16-way SP3 (375 Mhz, Nighthawk II) is included too. Data for the Alpha ES45 (1 GHz) is obtained from the streams data base. Data for p690/sp4 is with affinity enabled (6/1/02).

copy scale add triad alpha1 1339 1265 1273 1383 alpha2 1768 1711 1839 1886 alpha3 2279 2280 2257 2308 alpha4 2375 2323 2370 2427 es45-1 1946 1941 1978 1978 es45-2 2615 2592 2825 2850 es45-4 3487 3487 3527 3584 SP3 1 523 561 581 583 SP3 2 686 797 813 909 SP3 3 833 805 897 914 SP3 4 824 799 889 914 SP3/16-1 486 494 601 601 SP3/16-2 953 969 1161 1153 SP3/16-3 1422 1408 1775 1757 SP3/16-4 1703 1724 1955 1982 SP3/16-8 4850 4601 5060 5211 SP3/16-16 5475 5325 5976 5924 SP4-1 1774 1860 2098 2119 SP4-2 3513 3684 4166 4225 SP4-4 7170 7463 8238 8075 SP4-8 13101 13300 14986 14825 SP4-16 21598 21156 24106 23609 SP4-32 27271 26072 29539 28750 stream (f90/omp) multiple threads (aggregate MB/sec) The following graphs the triad bandwidth from the previous table.


The 9690/sp4 results are from June, 2002 with affinity supported by AIX. McCalpin reports the following improvement of aggregate STREAM with affinity on 32 processor p690 (4/15/02).

MBs no aff. Copy 22421 28611 22% Scale 21411 28994 26% Add 24830 32222 23% Triad 25501 32249 21%
It is expected that the power4 will show even higher memory bandwidth when using 16 MB (large) pages.

The following figure shows the Mflops for one processor for various problem sizes for the EuroBen mod2f, a 1-D FFT. Data access is irregular, but cache boundaries are still apparent.

The hint benchmark measures computation and memory efficiency as the problem size increases. (This is hint Version 1, 1994.) The following graph shows the performance of a single processor for the Alpha (66.9 MQUIPS), SP3 (27.3 MQUIPS), and SP4 (74.9 MQUIPS). The L1 and L2 cache boundaries are visible, as well as the SP4's L3 (128 MB).

The lmbench benchmark measures various UNIX and system characeristics. Here are some preliminary numbers for runs on a service and compute node of alpha and SP3/4 (version 2). (Results from previous lmbench version can be found here.) Open/close times are much slower for the Alpha, though file create/delete are faster on the Alpha. The cache/memory latencies reported by lmbench are

alpha sp3 L1 4 5 L2 27 32 memory 210 300 latency in nanoseconds

LOW LEVEL BENCHMARKS The following table compares the performance of the Alpha and SP for basic CPU operations. These numbers are from the first 14 kernels of EuroBen's mod1ac. The 14th kernel is a rough estimate of peak FORTRAN performance since it has a high re-use of operands.

alpha sp3 sp4 broadcast 516 368 1946 copy 324 295 991 addition 285 186 942 subtraction 288 166 968 multiply 287 166 935 division 55 64 90 dotproduct 609 655 2059 X=X+aY 526 497 1622 Z=X+aY 477 331 1938 y=x1x2+x3x4 433 371 2215 1st ord rec. 110 107 215 2nd ord rec. 136 61 268 2nd diff 633 743 1780 9th deg. poly 701 709 2729 basic operations (Mflops) euroben mod1ac

The following table compares the performance of various intrinsics (EuroBen mod1f). For the SP, it also shows the effect of -O4 optimization versus -O3.

alpha sp3 -O4 sp3 -O3 sp4 -O4 x**y 8.3 1.8 1.6 7.1 sin 13 34.8 8.9 64.1 cos 12.8 21.4 7.1 39.6 sqrt 45.7 52.1 34.1 93.9 exp 15.8 30.7 5.7 64.3 log 15.1 30.8 5.2 59.8 tan 9.9 18.9 5.5 35.7 asin 13.3 10.4 10.2 26.6 sinh 10.7 2.3 2.3 19.5 instrinsics (Mcalls/s) euroben mod1f (N=10000) The following table compares the performance (Mflops) of a simple FORTRAN matrix (REAL*8 400x400) multiply compared with the performance of DGEMM from the vendor math library (-lcxml for the Alpha, -lessl for the SP). Note, the SP4 -lessl (3.3) is tuned for the Power4. Also the Mflops for 1000x1000 Linpack are reported from netlib except the sp4 number is from IBM. alpha sp3 sp4 ftn 72 45 220 lib 1182 1321 3174 linpack 1031 1236 2894 In the following graph, the performance of the ATLAS DGEMM (xdl3blastst -F ) is compared with the vendor libraries. The plot includes data from the new Compaq ES45 (1 GHz). The p690 achieves only 65% of peak because of insufficient rename registers. The Alpha's and sp3 get a much higher percentage of peak.

The following table compares optimized FORTRAN performance (no essl/cxml) for Euroben mod2a, matrix-vector dot product and product.

------------------------------------------------- alpha sp3 sp4 Problem size| MxV-ddot | MxV-ddot | MxV-ddot | m | n | (Mflop/s) | (Mflop/s) | (Mflop/s) | -------------------------------------------------- 100 | 100 | 411.7 | 423.9 | 783.4 | 200 | 200 | 442.3 | 416.8 | 808.4 | 500 | 500 | 66.1 | 18.7 | 148.4 | 1000 | 1000 | 31.8 | 17.1 | 91.8 | 2000 | 2000 | 27.5 | 16.1 | 69.9 | -------------------------------------------------- -------------------------------------------------- alpha sp3 sp4 Problem size| MxV-axpy | MxV-axpy | MxV-axpy | m | n | (Mflop/s) | (Mflop/s) | (Mflop/s) | -------------------------------------------------- 100 | 100 | 101.9 | 401.9 | 1053. | 200 | 200 | 227.4 | 421.1 | 1092. | 500 | 500 | 205.4 | 411.9 | 857.5 | 1000 | 1000 | 205.6 | 274.5 | 746.8 | 2000 | 2000 | 66.9 | 207.9 | 730.2 | -------------------------------------------------

The following table compares the single processor performance (Mflops) of the Alpha and IBMs for the Euroben mod2g, a 2-D Haar wavelet transform test.

|------------------------------------------------------ | Order | alpha | SP3 | SP4 | | n1 | n2 | (Mflop/s) | (Mflop/s) | (Mflop/s) | |------------------------------------------------------ | 16 | 16 | 142.56 | 79.629 | 126.42 | | 32 | 16 | 166.61 | 96.690 | 251.93 | | 32 | 32 | 208.06 | 115.43 | 301.15 | | 64 | 32 | 146.16 | 108.74 | 297.26 | | 64 | 64 | 111.46 | 111.46 | 278.45 | | 128 | 64 | 114.93 | 101.49 | 251.90 | | 128 | 128 | 104.46 | 97.785 | 244.45 | | 256 | 128 | 86.869 | 64.246 | 179.43 | | 256 | 256 | 71.033 | 44.159 | 103.52 | | 512 | 256 | 65.295 | 41.964 | 78.435 | |------------------------------------------------------ The following plots the performance (Mflops) of Euroben mod2b, a dense linear system test, for both optimized FORTRAN and using the BLAS from the vendor library (cxml/essl).

The following plots the performance (Mflops) of Euroben mod2d, a dense eigenvalue test, for both optimized FORTRAN and using the BLAS from the vendor library. For the Alpha, -O4 optimization failed, so this data uses -O3.

The following plots the performance (iterations/second) of Euroben mod2e, a sparse eigenvalue test.

We ran a number of tests of NCAR CCM Column Radiation Model (CRM), using different compiler options, libraries, and problem sizes (300+ test cases). We used the executables from the SP3 on the SP4. The SP4 provided speedups from 1.1 to 2.6 faster than the SP3. The FORTRAN code is dominated by exponentials and square roots. Also see Worley PSTSW climate code test results.

EuroBen's mod3a tests matrix computation with file I/O (out of core). The following tables compare the Alpha with the IBM. The run was made using /tmp, and no attempt was made to optimize I/O performance.

Mod3a: Out-of-core Matrix-vector multiplication Alpha -------------------------------------------------------------------------- Row | Column | Exec. time | Mflop rate | Read rate | Write rate | (n) | (m) | (sec) | (Mflop/s) | (MB/s) | (MB/s) | -------------------------------------------------------------------------- 25000 | 20000 | 0.40751E-01| 24.539 | 226.92 | 62.082 | 50000 | 20000 | 0.80691E-01| 24.786 | 225.43 | 68.371 | 100000 | 100000 | 0.43051 | 46.455 | 250.62 | 73.322 | 250000 | 100000 | 1.4878 | 33.607 | 253.08 | 78.265 | -------------------------------------------------------------------------- SP3 -------------------------------------------------------------------------- 25000 | 20000 | .34146 | 2.9286 | 253.80 | .72046 | 50000 | 20000 | .74303 | 2.6917 | 255.00 | .62662 | 100000 | 100000 | 1.4190 | 14.094 | 248.45 | .88511 | 250000 | 100000 | 3.5659 | 14.021 | 248.31 | 1.1102 | -------------------------------------------------------------------------- p690 -------------------------------------------------------------------------- 25000 | 20000 | .16075 | 6.2207 | 39.468 | 109.57 | 50000 | 20000 | .25080 | 7.9744 | 497.94 | 1.8376 | 100000 | 100000 | .72657 | 27.526 | 160.30 | 3.9978 | 250000 | 100000 | 1.0282 | 48.626 | 400.84 | 13.699 | -------------------------------------------------------------------------- This should not be considered a rigorous test of the I/O subsystem.

SHARED-MEMORY BENCHMARKS

Both the Alpha and IBMs consist of a cluster of shared-memory nodes, each node with four processors sharing a common memory (16 for sp4). We tested the performance of a shared-memory node with various C programs with explicit thread calls and with FORTRAN Open MP codes.

The following table shows the performance of thread/join in C as the master thread creates two, three, and four threads. The test repeatedly creates and joins threads.

threads alpha sp3 sp4 2 47.7 96 44 3 165 152 68 4 251 222 97 thread create/join time in microseconds (C) Often, it is more efficient to create the threads once, and then provide them work as needed. I suspect this is what FORTRAN Open MP is doing for "parallel do". The following table is the performance of parallel do. Revised 9/8/03 threads alpha sp3 sp4 2 2.1 12.7 6.3 3 3.4 15.3 8.4 4 5.2 19.5 9.4 OPEN MP parallel DO (us) Notice that the performance is much better than the explicit thread calls. We've also done some testing with the OpenMP microbenchmarks. The following compares OpenMP performance between the sp4 and the SGI Altix.

The following table shows the time required to lock-unlock using pthread_mutex_lock with various number of threads. For the IBMs we use setenv SPINLOOPTIME 5000.

threads alpha sp3 sp4 1 0.26 0.6 0.3 2 1.5 1.4 1.3 3 17.8 2.1 1.6 4 29.6 2.9 3.8 time for lock/unlock (us) The following table compares the performance of simple C barrier program using a single lock and spinning on a shared variable along with pthread_yield. A version based on condition variables was an order of magnitude slower. threads alpha sp3 sp4 1 0.25 0.6 0.3 2 1.36 4.4 1.9 3 9.9 20.5 3.1 4 65 34.6 3.7 C barrier times (us) The following table illustrates linear speedup for an embarrassingly parallel integration. A C code with explicit thread management is compared with FORTRAN Open MP. Both just used -O optimization. fortran C threads alpha sp3 sp4 alpha sp3 sp4 1 252 102 251 166 52 216 2 502 204 501 331 104 432 3 748 306 752 496 157 648 4 990 408 1002 657 206 864 8 1999 1725 16 3565 3429 rectangle rule (Mflops) -O optimization The following table illustrates an explicit thread implementation of Cholesky factorization of a 1000x1000 double precision matrix in C (-O optimization). threads alpha sp3 sp4 1 150 125 350 2 269 238 631 3 369 353 1007 4 435 390 1306 cholp 1k matrix factor (mflops) -O optimization The following table compares FORTRAN OpenMP for the Alpha and SP doing a simple, double-precision Jacobi iteration. Note that the SP3 slows for 4 threads. problem size 500x500 1000x1000 threads alpha sp3 sp4 alpha sp3 sp4 1 175 114 247 27 17 62 2 342 284 466 42 27 117 3 503 421 650 50 41 160 4 655 324 850 61 41 198 iterations per second

MESSAGE-PASSING BENCHMARKS

Internode communication can be accomplished with IP, PVM, or MPI. We report MPI performance over the Alpha Quadrics network and the IBM SP. Each node (4 CPUs) share a single network interface. However, each CPU is a unique MPI end point, so one can measure both inter-node and intra-node communication. The following table summarizes the measured communication characteristics of the Alpha, SP3, and the SP4. SP4 is currently based on Colony switch via PCI.

alpha sp3 sp4 latency (1 way, us) 5.4 16.3 17 bandwidth (echo, MBs) 199 139 174 (345 MBs for dual plane) (exchange, MBs) 167 180 215 (367 MBs for dual plane) MPI within a node 622 512 2186 latency (min, 1 way, us) and bandwidth (MBs) -- latency Bandwidth (min 1 way us, MBs) alpha node 5.5 198 alpha cpu 5.8 623 alpha IP-sw 123 77 alpha IP-gigE/1500 76 44 alpha IP-100E 70 11 sp3 node 16.3 139 sp3 cpu 8.1 512 sp4 node 17 174 (PCI/Colony) sp4 cpu 3 2186 sp3 IP-sw 82 46 sp3 IP-gigE/1500 91 47 sp3 IP-gigE/9000 136 84 sp3 IP-100E 93 12

The following graph shows bandwidth for communication between two processors on the same node using MPI from both EuroBen's mod1h and ParkBench comms1. The SP3 performs better for smaller messages.

The sp4 is presently equiped with a Colony switch for inter-node communcation but is limited by the PCI interface at this time (May, 2002). The following graph shows bandwidth for communication between two nodes.

The p690 also supports dual rail Colony connections that roughly doubles the bandwidth for large messages as illustrated in the following graph.

The following table shows the performance of aggregate communication operations (barrier, broadcast, sum-reduction) using one processor per node (N) and all processors on each node(n). Recall that the sp4 has 16 processors per node (the other 4 per node). Times are in microseconds.

mpibarrier (average us) cpus alpha-N alpha-n sp3-N sp3-n sp4-n 2 7 11 22 10 4 4 7 16 45 20 13 8 8 18 69 157 18 16 9 21 93 230 27 32 11 28 118 329 64 37 145 419 mpibcast (8 bytes) cpus alpha-N alpha-n sp3-N sp3-n sp4-n 2 9.6 12.5 5.4 6.7 3.2 4 10.4 20.3 9.4 9.4 6.2 8 11.4 28.5 13.4 17.5 8.4 16 12.5 32.9 17.0 20.9 9.8 32 13.8 41.4 19.3 24.1 64 48.7 23.6 30.8 mpireduce (SUM, doubleword) cpus alpha-N alpha-n sp3-N sp3-n sp4-n 2 9 11 8 9 6 4 190 207 29 133 9 8 623 350 271 484 13 16 1117 604 683 1132 18 32 3176 1991 1613 2193 64 5921 2841 3449

PARALLEL KERNEL BENCHMARKS

Both ParkBench and EuroBen (euroben-dm) had MPI-based parallel kernels. However, the euroben-dm communication model was to have the processes do all of their send's before issuing receive's. On the SP, this model resulted in deadlock for the larger problem sizes. The EAGER_LIMIT can be adjusted to make some progress on the SP3 but the deadlocks could not be completely eliminated. MPI buffering on the Alpha was adequate. The maximum MPI buffering on an SP3 node was 64 MBytes, on the Alpha, 191 MBytes.

The following table show MPI parallel performance of the LU benchmark (64x64x64) for the Alpha and SP. The first column pair is one processor per node, the second pair is using all processors per node. These tests used standard FORTRAN (no vendor libraries).

Nodes CPUs alpha sp3 alpha sp3 sp4 2 786.08 617.98 762.92 588.16 1377.38 4 1708.1 1387.03 1604.05 1188.02 2660.97 8 3384.03 2561.97 3265.83 2473.80 5310.63 16 6190.89 5593.18 5556.02 4771.66 9531.77 aggregate Mflops Results for the FT benchmarks (CLASS=A) follow Nodes CPUs alpha sp3 alpha sp3 sp4 4 633 465 580 307 1314 8 1198 925 849 553 2351 16 2221 1890 1019 1056 3603 aggregate Mflops Results for the NAS SP benchmark follow. Nodes CPUs alpha sp3 alpha sp3 sp4 4 877 632 734 416 1219 9 2310 1623 1837 1225 2568 16 4344 2920 3143 2252 3939 aggregate Mflops The following plots the aggregate Mflop performance for ParkBench QR factorization (MPI) of 1000x1000 double precision matrix. One can compare the performance of optimized FORTRAN versus the vendor libraries (cxml/essl), and the difference in performance when using all processors on a node. Recall, that our SP4 has 16 CPUs sharing memory so we have included data (sp3-16 and sp3-16-essl) from the NERSC 16-way SP3 (375 MHz).

The following graph shows the aggregate Mflops for a multi-grid (MG) kernel from ParkBench/NAS Parallel Benchmark. This for a 256x256x256 doubleword grid with MPI and Wallcraft's co-array version and also OpenMP on the IBM. Revised 9/3/03.

The following graph shows the aggregate Mflops for a conjugate gradient (CG) kernel (CLASS=A) from NAS Parallel Benchmarks 2.3 using MPI and OpenMP. Revised 9/22/03

We also ran the OpenMP version of the NAS Parallel Benchmarks (PBN-O-3.0b4). The following table compares the performance of three of those benchmarks on the power4 to the NERSC Power3 (seaborg, 16-way shared memory, 375 MHz).

lu.A ft.A sp.A CPUs sp3 sp4 sp3 sp4 sp3 sp4 2 675 1466 356 1274 427 1300 4 1356 2974 695 2259 868 2379 8 2231 6370 1339 4166 1724 4264 16 2386 12148 2343 6860 2667 7476 aggregate Mflops compiled with -O3 -qarch=auto -qtune=auto -qcache=auto -qsmp=omp -qfixed One should be cautious when comparing these sp4 results to the MPI NAS results presented earlier.

Related Links

Research Sponsors

Mathematical, Information, and Computational Sciences Division, within the Office of Advanced Scientific Computing Research of the Office of Science, Department of Energy. The application-specific evaluations are also supported by the sponsors of the individual applications research areas.


Last Modified Thursday, 21-Apr-2005 07:46:32 EDT thd@ornl.gov (touches: 200707 )
back to Tom Dunigan's page or the ORNL Evaluation of Early Systems page or
ornl | ccs | csm: research | people | sitemap | disclaimer | search