Texas Memory Testing at ORNL's PROBE Facility

The Texas Memory RAM-SAN 520, a solid state disk device, was tested at the Oak Ridge National Laboratory's PROBE facility. PROBE is a testbed set up by the Department of Energy for use in testing storage devices and related software, with a focus on the High Performance Storage System (HPSS).  In addition to the tests which were run at the ORNL facility, a separate set of tests was run at the National Energy Research Supercomputer (NERSC) PROBE Facility, and the results of these tests are described in a separate report .


TexasMemory RAM-SAN-520 image
Texas Memory RAM-SAN 520

Executive Summary

The RAM-SAN 520 is an excellent device, with high bandwidth and very low latency, making it ideally suited for applications with high transactional requirements, as well as for data streaming applications. HPSS latencies caused small, random I/O accesses to perform as well on spinning disk as on the RAM-SAN, even though the RAM-SAN is capable of more I/Os per second than either of the other devices that were tested. The RAM-SAN's relatively high cost makes it less attractive than other fiberchannel-based disk devices for applications which require both high streaming rates and very large persistent storage. It is, however, particularly well-suited for database usage, such as the SFS metadata used by HPSS. We also recommend its use for other HPSS-related operations such as metadata consistency checking and/or conversion to other databases in the future.

The RAM-SAN 520 was easy to set up and configure, and we were very impressed by its reliability and the quality of the disk emulation microcode. The device was able to handle all of the load that we were able to put on it, and was able to interface with AIX, IRIX and Solaris operating systems without any problems. We also received excellent support from Texas Memory Systems throughout the testing.

Texas Memory RAM-SAN 520 Characteristics

*Texas Memory has doubled the memory capacity since these tests were run

Testing Goals


Testing at the ORNL and NERSC PROBE facilities was intended to provide information in several areas:

A series of tests were run on the RAM-SAN, which are described in the following sections.



Test Configuration

TMS_TESTBESD_IMAGE

The test configuration consisted of the following:

Setup/Compatibility Results

Installation was simple, as the device uses a standard single phase 110 volt AC connector, and fits in a standard 19" rack, with no special climate requirements.

The Texas Memory RAM-SAN is able to communicate via either an RS232 direct-attached TTY, or by using an a terminal program on an ethernet-attached remote computer system. At ORNL, we used both a VT100-compatible direct attached terminal, and a terminal window on a Solaris system.  The IP address was configured via the Maintenance Console; however, the RAM-SAN also supports configuration of the IP address via rarp and bootp. We did not test either of these capabilities

The RAM-SAN Maintenacne Console, or MCP as has a simple, easy to use and well documented diagnostic command set, which we used to verify proper operation of the device. Almost all of the commands worked as documented, except for the diagmem command, which was unavailable.

The commands to partition the memory, and to assign partitions to particular ports, were clearly explained with good examples, and worked as documented - configuring the device for the tests took only a short time. In addition, as the tests were being run, reconfiguring the device was also straightforward and easy to do.

At the time that the RAM-SAN was installed in the PROBE testbed, it had never been tested with the IBM AIX operating system. When the device was installed, it initially did not work - AIX was unable to successfully configure the device. With the help of Texas Memory personnel, we enabled SCSI command tracing for the port (via the RAM-SAN console},  TMS personnel quickly determined that the problem was a SCSI command that had not yet been implemented in the microcode. Within a short time, they provided a patch and a Solaris-based utility to download the patch to the RAM-SAN into microcode, and this fixed the problem. By using a console command, we were able to save the patched microcode in flash memory, so that it was not necessary to reload the patch for the duration of the testing, either at ORNL or at the NERSC PROBE facility.

Once this  initial problem was resolved, the RAM-SAN performed flawlessly for the remainder of the testing, at both PROBE facilities. The stability of both the hardware and disk emulation software was very impressive. One feature of the RAM-SAN which came in very handy during the testing was its ability to provide a realtime display of the data transfer activity for each port.


Raw I/O Test Results

HPSS uses raw logical volumes for its disk devices. In order to ascertain the maximum data rates that we could obtain on raw devices independently of HPSS, we ran a series of tests on the RAM-SAN 520, a Sun T3 (it was called a T300 at the time of these tests), and an IBM Serial Storage Architecure (SSA) RAID device.

An important tuning parameter for disk devices on AIX is the queue depth, which determines how many commands are queued for the device.  If this parameter is set too low, the latency of issuing new requests to the device can signficantly impair the transfer rates; if the parameter is set too high, the overhead of selecting a request to process can also reduce the transfer rates. Different devices have different max queue depths, which can be determined via the AIX lsattr -R -l device_name -a queue_depth command, or (in most, but not all cases), via the interactive SMIT administration tool. The device characteristics can be changed either via SMIT or by using the chdev command; depending on the characteristic that is being changed, this may require the device to be varied offline prior to issuing the command.

Our tests used DD to write 1 GB of data onto a raw logical volume on each of the devices, using /dev/zero as the local device, and then reading 1 GB of data from the device, writing to /dev/null .  The tests varied the queue depth of the device from 1 to 16, and varied the DD buffer size from 1 KB (1K=1024) to 32 MB, by doubling the buffer size for each test. It was observed that increasing the queue depth beyond 16 did not improve the I/O rates, so we stopped at that point.

Some sample charts are shown below:

dd summary. queue_depth=1

dd tests, queue depth=4

dd tests, queue depth=16

As can be seen, the RAM-SAN 520's low latency allowed it to reach its maximum transfer rate with a queue depth of 1, and relatively small I/O buffer sizes, while the spinning disks, with their higher latency, require both a higher queue depth and larger I/O buffer sizes in order to achieve their maximum rates. At queue depth 4 and beyond, the Sun T3 was able to achieve slightly higher write rates than the RAM-SAN when using buffer sizes of 1MB or more. At queue depth 8, the T3 was able to achieve a rate of 80MB/s at buffer sizes > 16MB, and at queue depth 16, the T3 was able to achieve 81.9 MB/s using 16MB buffers, and to achieve a peak rate of 83.1MB second, slightly higher than the RAM-SAN peak read rate of 80.3 MB/s.

These tests were all run on the IBM S80, which had an Emulex LP7000 FC adapter, and used the IBM version of the driver. Interestingly, slightly higher rates were achieved on a less powerful machine, the IBM F50, using an Emulex LP8000 FC adapter, and using the drivers supplied by Emulex:

f50_dd_summary.gif


(Note: we tried moving the LP8000 adapter to the S80 to see if we could reproduce the higher rates on the S80, but ran into problems with the AIX driver and were unable to perform the tests)

The highest read rates for all systems tested were achieved on the SGI IP27:

sgi_dd_summary.gif


Mirroring Test Results

Since the RAM-SAN 520 does not have any built-in persistent backup, we wanted to explore the possibility of using the AIX logical volume mirroring feature to pair the RAM-SAN 520 with the Sun T3, so that the primary physical partition (PP) for each logical volume would be placed on the RAM-SAN 520, and the secondary partition would be placed on the Sun T3.

AIX Logical Volume Storage Overview

The AIX operating system supports a hierarchy of structures for managing disk storage. Every fixed disk (called a physical volume, or pdisk for short) is assigned to a  volume group , which is composed of 1 to 128 physical volumes.  Within a volume group, space is allocated by dividing the volume group into equal-sized chunks of space called physical partitions, usually abbreviated as  PP s.  The PP size varies depending on the total amount of space in the volume group; typically, for volumes smaller than 300MB, a PP size of 2 megabytes is used. The PP size increases by power-of-2 multiples; for a volume group composed of very large disk drives, a PP size of 256MB is a typical value.

Within a volume group, one or more logical volumes can be defined. Logical volumes are composed of one or more logical partitions , usually abbrevieatd as  LPs .  Each LP is composed of one, two, or three PPs, depending on the mirroring strategy to be used for the logical volume, discussed in more detail below. Logical volumes can be extended, but not decreased, by using the changelv command, or via the System Management Interface Tool (SMIT).

Addressing within a logical volume is sequential, starting at 0; however space in a logical volume is not necessarily physically contiguous. The logical volume manager (LVM) is repsonsible for converting logical addresses into physical addresses when applications issue reads and writes to the logical volumes.

Filesystems are created upon logical volumes. Since logical volumes can be extended, so can filesystems. The most commn type of filesystem in AIX is the journaled filesystem (JFS).  Journaled filesystems contain a record of transactions that allows the filesystem to be synchronized in the event of a system crash, power failure, etc. Filesystems consist of fragment-sized pages, typically 4k, which are allocated to files as they are written. These pages may or may not be contiguous with other pages that belong to the file.  The filesystem layer of AIX also provides for using system memory for caching file pages.

Since HPSS uses raw logical volumes for its storage, no filesystem I/O test results are presented here.

Logical Volume Mirroring Overview

In order to increase reliability, AIX LVM provides the ability to create logical volumes that are mirrored. Mirrored logical volumes can contain either one or two extra copies of each partition, called the secondary and
tertiary copies, in addition to the main, or primary copy. There are several policies which are used to control the layout of the partitions on physical disks within the volume group:

sequential: In this mode, writes are performed to the mirrors in the order primary, secondary, tertiary.  The write to one physical partition completes before the the write operation is started on the next one. For reads, the primary copy is read;  if unsuccessful, the next copy is read, and, if successful, and attempt is made to rewrite the failed block using hardware relocation.

parallel : in this mode, writes to all physical partitions in a logical partition are started at the same time. The operation completes when the last physical write (to the partition that takes the longest to complete) finishes. For reads, I/O is balanced amongst the mirrors by issuing the read to the first non-busy mirror in the order primary, secondary, tertiary. If all mirrors are busy, the read is queued to the device with the smallest number of outstanding I/Os. 

parallel write/sequential read: in this mode, writes are initiated concurrently, and reads are issued to the primary mirror, as in the sequential mode.

parallel write/round-robin read: in this mode, writes are initiated concurrently, and reads are issued alternately to each mirror. This results in equal utilization of the mirrors for reads, even if there is never more than one I/O in-progress at a time.

AIX Striped Logical Volumes

 Since AIX does not support mirroring in conjunction with striped logical volumes, and since it would be not be useful for a site to use the RAM-SAN 520 without some sort of mirrored backup, we did not perform any striped logical volume testing of the RAM SAN 520.

RAM-SAN 520 Mirrored Logical Volume Test Results

We tested the AIX mirroring capability by using DD to write to and read from a logical volume with 2 copies (primary and secondary) of each physical partition.  The tests were first run using a logical volume created with range=minimum, so that all PPs for each mirror copy resided on a single physical drive, as shown in the charts below:

Mirror results for RANGE=MINIMUM


Write performance was uniformly poor, regardless of the scheduling policy used, achieving less than 50% of the rates for the non-mirrored tests described earlier in this report.  The most likely cause of this degradation is the setting of the Mirror Write Consistency (MWC) policy, which requires extra bookkeeping overhead, as described above. The MWC policy option is enabled by default (and it would be rare for a site to disable it in a real production environment); we left it enabled for all of the mirrored LV testing that we performed.  Although it would be interesting to determine the performance degradation that occurs when this policy is in effect, we did not do so, due to time constraints.

Parallel reads, on the other hand,  showed a substantial improvement over the non-mirrored tests.  The "round-robin" scheduling method achieved over 120MB/s, compared to ~80MB/s for reads using the sequential scheduling policy:

mirror tests 2 PVs, reads

Mirror tests - DD reads - RANGE=MINIMUM

We also performed a set of test with  range=maximum , causing the PPs to be spread across the drives as shown in the figures below. Again, write performance was poor compared to the non-mirrored case, while read performance was better.   Note that read performance did not improve as much for this case as for the range=minimum case; most likely this was due to adapter contention involved in accessing data for PPs 2 and 5, since both copies of these PPs resided on the T3, and both T3 physical drives shared a single FC interface to the AIX hosts.

range=max dd results.gif
Mirror Tests - range=maximum - write test

range=max dd results.gif
Mirror Tests - range=maximum - read test

Here is a summary comparison of the mirrored vs non-mirrored results:

mirrored vs non-mirrored results.gif

HPSS Sequential I/O Test Results

In order to measure the performance of the RAM-SAN 520 as an HPSS data device, we ran a series of tests, using the S80-s single fibrechannel connection to the RAM-SAN. For these tests, both the HPSS mover and the HSI client application resided on the S80. HSI (Hierarchical Storage Interface) is a popular client interface used to transfer files to/from HPSS systems. It features double-buffered I/O (using a buffer pool mechanism when doing HPSS striped I/O), automatic support of HPSS parallel transfer mechanisms, and the ability to dynamically vary the I/O buffer size used when transferring files. In general, HSI provides better transfer rates than the HPSS Parallel FTP client, and also provides a number of other powerful features, such as recursion for transferring entire file trees, and for changing permissions or ownership of files.

There are a number of factors that all influence transfer rates when using HPSS. For small files, the transfer rate can can be be dominated simply by the time that it takes to initiate I/O from a client application and propagate the I/O request through the Bitfile Server, Storage Server, and eventually to the mover. For writes, additional overhead time is taken up by time it takes for space allocation. Other factors include the Virtual Volume block size (VV Block Size)that is defined for the Storage Class to the which the device is assigned, the mover's buffer size, the client's buffer size, the transport mechanism used, and the network options used when transferring data between machines via TCP/IP. Within HPSS, for HPSS releases prior to HPSS 4.1.4, disk devices are divided into a maximum of 16384 allocation units, called "Virtual Volume Blocks". This results in a tradeoff when configuring very large devices - either a small number of devices with very large VV Block Sizes can be configured, or, subject to operating system limitations, the device can be carved into a (potentially large) number of small devices (logical volumes on AIX). This tradeoff is important both when considering the size of files that will be stored, and when determing optimal buffer sizes for transfers. If large VV block sizes are used, but small files are stored, then large amounts of disk space can be wasted. For file transfers, the tradeoff comes into play primarily when HPSS striped I/O is being used. In this case, each mover in the stripe will be transfer consecutively numbered VV blocks, and will transfer at most one VV block per exchange with the client. If the VV block size is too small, then the message overhead for the message exchanges can adversely affect performance. If the VV block size is too large, then both client and mover require very large buffers in order to transfer data in parallel.

We ran an extensive series of tests, varying the mover buffer size from 1 to 16 megabytes, in increments of 1 megabyte, and varying the HSI client buffer size from 512K to 32MB. For these tests, the Texas Memory box was configured with a 1-way striped storage class, using a 4MB VV block size. The HSI client buffer size was initially set to 512K, then increased to 1MB, and increased in units of 1 MB until it reached 32MB. All tests used shared memory buffers within the S80 for the transfers. For each test, a set of files ranging in size from 512K to 1 GB was written to and read from HPSS, using /dev/zero as the source device when writing to HPSS, and /dev/null as the sink device when reading from HPSS. Each test was run 3 times, and the results averaged. Some sample results are shown below:

HSI Put (1MB HSI Buffer) results

HSI Get (1 MB HSI Buffer, 1-8 MB mover buffers)

For the case where the mover buffer size is less than the 4MB VV block size used for the storage class, the mover will issue 4mb/n, where n is the mover buffer size, data exchange messages with the client for each VV block. For the cases above, where the HSI buffer size was fixed at 1MB, peak rates were obtained when files > 128MB were transferred, both for writes and reads. Interestingly, the rates for mover buffer sizes of 3x (x=1MB) appear to degrade write transfer rates on the 2MB and 4MB files. It's unknown whether this is an HPSS/client/OS/FC driver issue

When the HSI buffer size was increased to 4MB, write performance did not seem to be affected, except for file sizes >= 32MB; for these sizes, odd-size mover buffers again caused degradation

HSI Put (4MB HSI Buffer) 1-8 MB Mover Buffer

HSI Get (4 MB HSI Buffer, 1-8 MB Mover Buffer)

Conventional wisdom dictates that better transfer rates should be achieved with larger client buffer sizes, however, as can be seen below, this was not the case:

HSI Get (4 MB HSI Buffer, 1-8 MB Mover Buffer)

HSI 32MB read test

To look at it another way, here is a chart showing transfer rates using a fixed (4MB) mover buffer size and varying the client buffer size. Best rates were achieved when the client buffer size matched the 4MB mover buffer size, which also matched the VV block size.

HSI Put 4mb mvrbuf, varying HSI buf

As a comparison between DD rates and HSI rates, here are two charts showing the DD rates achieved while writing and reading a 1 GB file, using various block sizes, and the equivalent HSI rates for puts and gets of a 1 GB HPSS file. For this chart, a fixed-size HPSS mover buffer size was used (4 MB and 16 MB), and the HSI client buffer size was varied from 512KB to 32MB.

HPSS and HSI were able to achieve a nearly the same rates as the raw DD transfers for both reading and writing, although writes were not quite as close to actual device rate as reads. This is most likely due to the latency involved in HPSS's current space allocation algorithms, which requires coordination between the Bitfile Server, Storage Server, and SFS for each disk storage segment.

HSI vs DD summary

HSI vs DD summary


HPSS Random I/O Test Results

Since the RAM-SAM 520 has essentially zero latency, we ran a series of tests to compare random HPSS I/O on the RAM-SAN and on both the IBM SSA and Sun T3. For these tests, a simple HPSS client API program was written, which wrote a 512MB test file, then performed a series of 1000 random seeks and random-sized reads.

We ran this test using 1, 2, 5 and 10 processes. The test results are shown below:

Random I/O 1-process test results

random I/O summary

Somewhat surprisingly, for a single copy of the test program, the random I/O test took approximately the same time to complete, regardless of which device was used to contain the HPSS data files. When 5 concurrent processes were run, the SSA disk took approximately 22% longer to complete the tests than the other two devices. When 10 concurrent processes were run, the Sun T3 took approximately 10% longer than the RAM-SAN, while the SSA took 3 times as long as the RAM-SAN.

Our conclusion is that the latency for issuing I/O within HPSS was the main factor in the tests involving a small number of processes. The RAM-SAN's inherent transactional superiority did not come into play in these tests until a large number of concurrent processes were active.

Since HPSS also supports random I/O on direct-to-tape Classes of Service, we also ran the test using an STK 9840 drive for the HPSS data file, primarily to verify that it works with this (relatively new at the time of the test) device type:

random I/O summary


Conclusions

The Texas Memory RAM-SAN 520 proved to be easy to install, simple to configure, and easy to reconfigure as needed to change partition sizes, shift logical unit assignments, etc. After TMS personnel quickly resolved a problem with using the device on an AIX platform, the software and hardware performed flawlessly throughout our testing.

Our testing showed that using AIX logical volume mirroring, as a means of compensating for the lack of persistent backup within the device, caused severe write performance degradation, most likely due to the extra bookkeeping required when the Mirror Write Consistency policy is enabled. On the other hand, read performance improved substantially when parallel scheduling was enabled.

HPSS tests did not reveal any particular strengths or weaknesses that could be directly attributed to the use of the RAM-SAN as a data device. HPSS and the HSI client program were able to achieve between 94% and 96% of the raw device transfer rates for writing, and between 94.8% and 98.9% of the raw device rates for reading.

Some interesting interactions between VV block size, mover buffer size, and client buffersize were revealed when the sequential transfer test suite was run. In particular, the sharp performance drop for 32MB files, for particular buffer combinations, would be a useful topic for further investigation.

Finally, while the RAM-SAN 520 is probably not a cost effective device as an HPSS data device, it appears to be ideal for environments where data is written relatively infrequently compared to the number of times it is read, or where relatively small chunks of data are written at a time (as opposed to the HPSS large-file streaming environment). In addition, its nearly zero seek latency and high transaction rate make it ideal for applications such as web servers, online reservation systems, realtime data acquistion systems, or applications requiring high speed random access to data.