Summary of S80 Testing Activities at ORNL

This information was extracted from a summary report to the PROBE management of testing performed on behalf of LLNL, in order to obtain preliminary performance data on gigabit ethernet and the S80, to assist in the decision-making process for setting the future direction of their HPSS environment.


I. CPU Benchmarks

The initial tests which I ran were intended to determine two things:

  1. What is the relative CPU performance of the S80 with respect to other AIX platforms, particularly compared to the silver nodes which LLNL is currently running HPSS on.
  2. Was the operating system able to effectively make use of all of the CPUs without incurring excessive system overhead.

To get some performance data, I ran the BYTEMark CPU benchmark - I've enclosed a writeup on this benchmark as an attachment. I have attached the initial results of this benchmark (I had sent these back in December, as well).

S80 Benchmarks (excel) depending on your browser/settings, you may have to right click to download

This benchmark was originally written by BYTE Magazine. It performs a suite of 10 tests, and compares the results to a baseline system (a Dell Pentium 90 with 16 MB of RAM and 256K L2 cache). The original benchmark suite was ported to Linux, and later was modified to better work with 64-bit machines, as explained in the README that is included in the sources. The suite consists of integer, floating point, and memory-intensive tests, using real-world algorithms.

Although the first set of runs were mainly intended to verify that the OS could make use of all the CPUs in a heavily CPU-bound mix (which was verified by running the "monitor" and "netpmon" utilities), the S80's performance vs. the other IBM systems raised some questions, in particular with regard to the floating point tests, in which the S80 was faster than the nighthawk 1 node that I ran the tests on at the San Diego Supercomputer Center.

It was speculated that compiler options might account for at least some of the differences in the results. In order to measure the effects of compiler optimizations, a series of runs was performed, using every compiler variant available on the different AIX platforms (cc, cc_128, cc_r, ...) at 4 different optimization levels (-O, -O2, -O3 and -O4). Each variation of the above was run 3 times and the average of the runs was taken for each of the 10 tests in the suite. The results of this analysis are included as an Excel spreadsheet attachment. [Note that the benchmarks try hard to verify that the the results are repeatable, using built-in statistical analysis functions, and varying the number of runs from 5 up to 30 if need be. If the results are still not at a 95% confidence level, the program reports this in the final averaged results].

compiler options (excel)

Note that there was no compiler installed on the F50, and therefore this spreadsheet does not contain any F50 results; the first test used a binary that was built on the S80 and copied over to the F50.

The optimization levels did make a difference - in some cases for the better, and in other cases the more highly optimized code resulted in decreased performance. A summary chart in this spreadsheet contains the "best" value for each processor for each test; the chart also contains a complete set of charts showing the performance for each compiler variant and optimization level for each benchmark. The two surprises in this set of tests (to me, at least) were how well the silver nodes performed, and the performance of the winterhawk nodes in the memory-intensive tests.

One additional test was run on the IBM winterhawk node, in which the particular node architecture was targeted by the compiler options. The results from this run were slightly worse than those obtained from the non-targeted optimized code, so I did not pursue this avenue any further.

These results should be viewed only as an indicator of performance - it was not my intent to do an exhaustive benchmark comparision of the processors involved, but rather to determine if there was a gross difference between, say, optimization level -O and -O4, for example or if the floating point performance for the nighthawk would be better than that of the S80 at higher optimization levels.

II. Memory Benchmarks

In order to compare the memory performance of the systems, I ran the STREAM benchmark (a standard benchmark), as well as a simple memory to memory copy program called "memcpy", which simply does block copies of a fixed size number of bytes between two dynamically allocated buffers using the "memcpy" library function. The tests were run with the default buffer size (1300K +53 bytes), and then with a larger buffer size (20000K + 53bytes) (1K=1024) to guarantee that the buffers would not fit into L1 or L2 cache. Each test was run 3 times on each of the systems, and the results averaged. A copy of the spreadsheet comparing these results is attached. The winterhawk node was the clear winner in this area when caching was avoided.

STREAM benchmark (excel)
memcpy benchmark (excel)
comparison (excel)

III. Gigabit Ethernet Benchmarks

Tests were run with the IBM Gigabit Ethernet-SX PCI adapters on the S80, H70 and F50. Initally, since the adapters were all connected to a switch which did not support jumbo packets, tests were run using 1500 byte packets between the S80 and F50. I did not create a spreadsheet with these results, but I've summarized the results below for the highest rate that I saw:

S80/F50 TTCP Transfer Results
Direction 
 Socket Buffer 
Size (sender)
 Socket Buffer 
Size (recvr)
  Xfer 
Length 
Xfer 
Rate
SysCPU%
sender
SysCPU%
receiver
F50->S80524288 1048576  32768 61.16MB/s~50%  ~8-10%
S80->F50 524288  1048576 32768  24.01MB/s~2-5%~20%

No matter what I tried, I was unable to improve the rate from the S80 to the F50. I believe the limitation is on the F50 side, perhaps due to a driver bug. The F50 is currently running the release version of the PCI Ethernet driver, and both the S80 and H70 have the latest patches installed. This might be worth investigating further; however, initially I was just trying to get some baseline numbers for the gigE adapters, and was more interested in trying to get jumbo frames working, so I didn't pursue this at the time.

In order to test jumbo frames, we hooked up a dedicated ethernet adapter on both the S80 and H70 in a direct-connect private network. As you recall, we had some problems with AIX driver patch levels in trying to get the gigabit ethernet cards to support jumbo packets, due in part to the different AIX levels on the two systems, and Dan and Stan upgraded the H70 to AIX 4.3.3 in order to run the same level of AIX on both systems. This in turn caused a problem due to an AIX bug that broke integrated login.

A spreadsheet containing the results of the tests between the S80 and H70 is included. The tests were run using TTCP and "monitor" at the same time, under the control of a perl program which was used to start and stop "monitor" and collect the relevant output for each run in its own file. Each permutation of socket buffer and transfer length was run 3 times, and the transfer rates and cpu usage statistics were extracted from the logs and averaged. [All of the raw data is available if you are interested].

Each set of tests were run using all power of 2 permutations (16K, 32K, 64K ...) between 16K and 1024K for the socket buffer size (ttcp "-b" option) and btween 8K and 1024K for the read/write transfer lengths (ttcp "-l" option). [It seems to be the case on AIX 4.2 and earlier that the amount of data written to the network with a single write can make a significant difference in transfer rates, so each socket buffer size was tested with 8K,16K,32K,...1024K read/write lengths].

In addition, statistics were collected and graphed separately for the transmitter and receiver in each direction (S80->H70, H70->S80), primarily to compare the system CPU percentages when sending vs. receiving. Note that the CPU percentages are based on the total number of processors (6 on the S80 and 4 on the H70).

There are 4 spreadsheets included which contain the results in each direction from each perspective (sender,receiver). each containing a chart with the transfer rates and system CPU percentage at the various socket buffer sizes and transfer lengths. Note that user CPU usage was negligible in all cases, and I didn't bother to graph it, although the numbers are included.

H70 from S80 (excel)
H70 to S80 (excel)
S80 from H70 (excel)
S80 to H70 (excel)

After running all of the permutations of socket buffer size and read/write lengths, I ran a sustained test where I transferred 10 TB of data in one direction (S80->H70), using the best combination of socket buffer size and transfer length as determined above. The test was run as 10 consecutive 1-TB transfers, with the thought in mind that if the driver or card crashed after running for a while, perhaps there would be at least some transfer rate information logged, and also to avoid any potential 32-bit counter overflows in the ttcp code.

As I reported earlier, this test completed in 32.7 hours, and achieved a sustained transfer rate of 91282 KB/s (1K=1024) or 93,472,768 bytes/second (89.14 MB/s, if 1MB is defined as 1048576). User CPU percentage was essentially nil (.2-.3% on both systems), and the system CPU percentage as reported by "monitor" was about 9-10% on the 6-processor S80 (~.5 -.6 of a single processor) and about 21% on the 4-processor H70 (or about .9 of a single processor).


submitted by Mike Gleicher mkg@san.rr.com

PROBE: Home || Projects || Resources: ORNL/ NERSC || Press Releases
ornl | ccs | csm: research | people | sitemap | search

URL http://www.csm.ornl.gov/PROBE/S80.html
Updated: Wednesday, 27-Feb-2002 15:15:15 EST
webmaster