
A series of tests were run on the RAM-SAN, which are described in the following
sections.
The commands to partition the memory, and to assign partitions to particular ports, were clearly explained with good examples, and worked as documented - configuring the device for the tests took only a short time. In addition, as the tests were being run, reconfiguring the device was also straightforward and easy to do.
At the time that the RAM-SAN was installed in the PROBE testbed, it had never been tested with the IBM AIX operating system. When the device was installed, it initially did not work - AIX was unable to successfully configure the device. With the help of Texas Memory personnel, we enabled SCSI command tracing for the port (via the RAM-SAN console}, TMS personnel quickly determined that the problem was a SCSI command that had not yet been implemented in the microcode. Within a short time, they provided a patch and a Solaris-based utility to download the patch to the RAM-SAN into microcode, and this fixed the problem. By using a console command, we were able to save the patched microcode in flash memory, so that it was not necessary to reload the patch for the duration of the testing, either at ORNL or at the NERSC PROBE facility.
Once this initial problem was resolved, the RAM-SAN performed flawlessly for the remainder of the testing, at both PROBE facilities. The stability of both the hardware and disk emulation software was very impressive. One feature of the RAM-SAN which came in very handy during the testing was its ability to provide a realtime display of the data transfer activity for each port.
Our tests used DD to write 1 GB of data onto a raw logical volume on each of the devices, using /dev/zero as the local device, and then reading 1 GB of data from the device, writing to /dev/null . The tests varied the queue depth of the device from 1 to 16, and varied the DD buffer size from 1 KB (1K=1024) to 32 MB, by doubling the buffer size for each test. It was observed that increasing the queue depth beyond 16 did not improve the I/O rates, so we stopped at that point.
Some sample charts are shown below:
As can be seen, the RAM-SAN 520's low latency allowed it to reach its maximum transfer rate with a queue depth of 1, and relatively small I/O buffer sizes, while the spinning disks, with their higher latency, require both a higher queue depth and larger I/O buffer sizes in order to achieve their maximum rates. At queue depth 4 and beyond, the Sun T3 was able to achieve slightly higher write rates than the RAM-SAN when using buffer sizes of 1MB or more. At queue depth 8, the T3 was able to achieve a rate of 80MB/s at buffer sizes > 16MB, and at queue depth 16, the T3 was able to achieve 81.9 MB/s using 16MB buffers, and to achieve a peak rate of 83.1MB second, slightly higher than the RAM-SAN peak read rate of 80.3 MB/s.
These tests were all run on the IBM S80, which had an Emulex LP7000 FC adapter,
and used the IBM version of the driver. Interestingly, slightly higher rates
were achieved on a less powerful machine, the IBM F50, using an Emulex LP8000
FC adapter, and using the drivers supplied by Emulex:
The highest read rates for all systems tested were achieved on the SGI IP27:
Since the RAM-SAN 520 does not have any built-in persistent backup, we wanted to explore the possibility of using the AIX logical volume mirroring feature to pair the RAM-SAN 520 with the Sun T3, so that the primary physical partition (PP) for each logical volume would be placed on the RAM-SAN 520, and the secondary partition would be placed on the Sun T3.
The AIX operating system supports a hierarchy of structures for managing disk
storage. Every fixed disk (called a physical volume, or pdisk for short) is assigned to a volume group , which is composed
of 1 to 128 physical volumes. Within a volume group, space is allocated
by dividing the volume group into equal-sized chunks of space called physical
partitions, usually abbreviated as PP s. The PP size
varies depending on the total amount of space in the volume group; typically,
for volumes smaller than 300MB, a PP size of 2 megabytes is used. The PP size
increases by power-of-2 multiples; for a volume group composed of very large disk
drives, a PP size of 256MB is a typical value.
Within a volume group, one or more logical volumes can be defined. Logical volumes are composed of one or more logical partitions , usually abbrevieatd as LPs . Each LP is composed of one, two, or three PPs, depending on the mirroring strategy to be used for the logical volume, discussed in more detail below. Logical volumes can be extended, but not decreased, by using the changelv command, or via the System Management Interface Tool (SMIT).
Addressing within a logical volume is sequential, starting at 0; however space in a logical volume is not necessarily physically contiguous. The logical volume manager (LVM) is repsonsible for converting logical addresses into physical addresses when applications issue reads and writes to the logical volumes.
Filesystems are created upon logical volumes. Since logical volumes can be extended, so can filesystems. The most commn type of filesystem in AIX is the journaled filesystem (JFS). Journaled filesystems contain a record of transactions that allows the filesystem to be synchronized in the event of a system crash, power failure, etc. Filesystems consist of fragment-sized pages, typically 4k, which are allocated to files as they are written. These pages may or may not be contiguous with other pages that belong to the file. The filesystem layer of AIX also provides for using system memory for caching file pages.
Since HPSS uses raw logical volumes for its storage, no filesystem I/O test
results are presented here.
In order to increase reliability, AIX LVM provides the ability to create logical
volumes that are mirrored. Mirrored logical volumes can contain either one or two
extra copies of each partition, called the secondary and
tertiary copies, in addition to the main, or primary copy.
There are several policies which are used to control the layout of the partitions
on physical disks within the volume group:
AIX provides two configuration options to control inter-disk allocation:
- The range option is used to control the number of disks used for
each copy of the logical volume. The settings for this option are:
- The strict option is used to determine whether creating or extending a logical volume will succeed or fail if two or more copies of a physical partition must occupy the same physical volume. If strict interdisk policy is in effect, then the operation will fail if each physical partition cannot be allocated on a separate physical volume.
In conjunction with the above options, AIX provides fine-grained control to allow specific physical drives to be used for each logical volume, and even provides the ability to specify a map file to control assignment of specific PPs on individual drives for each LP copy.
The range and strict options are used when determining how
the physical partitions are to be allocated amongst the physical volumes
within the volume group. For example, for a non-mirrored logical volume,
if range=minimum, and free space is available on a single physical drive,
then all the PPS for the LV can be assigned to that physical drive.
If space is not available on a single drive, then 2 or mor physcial drives
may be used to contain the LV. If range=maximum is specified,
then each PP is assigned to the next drive with free space in the volume group,
in round-robin fashion. For mirrored logical volumes, allocation is more
complicated - first, the primary copy of the LV is created as described above,
then the secondary copy is created, with the additional constraint (assuming
strict inter-disk allocation is in effect), that each PP of the secondary
copy must be created on a different physical disk than the corresponding PP of
the primary copy. In practice, this ususally means that the round-robin
allocation process for the secondary mirror starts on the pdisk following the
pdisk used for the primary copy. If a tertiary mirror is also created,
then each PP of the tertiary must be allocated on a pdisk that contains neither
the primary nor secondary copy of the PP.
sequential: In this mode, writes are performed to the mirrors in the order primary, secondary, tertiary. The write to one physical partition completes before the the write operation is started on the next one. For reads, the primary copy is read; if unsuccessful, the next copy is read, and, if successful, and attempt is made to rewrite the failed block using hardware relocation.
parallel : in this mode, writes to all physical partitions in a logical partition are started at the same time. The operation completes when the last physical write (to the partition that takes the longest to complete) finishes. For reads, I/O is balanced amongst the mirrors by issuing the read to the first non-busy mirror in the order primary, secondary, tertiary. If all mirrors are busy, the read is queued to the device with the smallest number of outstanding I/Os.
parallel write/sequential read: in this mode, writes are initiated concurrently, and reads are issued to the primary mirror, as in the sequential mode.
parallel write/round-robin read: in this mode, writes are initiated concurrently, and reads are issued alternately to each mirror. This results in equal utilization of the mirrors for reads, even if there is never more than one I/O in-progress at a time.
We tested the AIX mirroring capability by using DD to write to and read from
a logical volume with 2 copies (primary and secondary) of each physical partition.
The tests were first run using a logical volume created with range=minimum,
so that all PPs for each mirror copy resided on a single physical drive, as shown in the charts below:
Parallel reads, on the other hand, showed a substantial improvement over the
non-mirrored tests. The "round-robin" scheduling method achieved over 120MB/s,
compared to ~80MB/s for reads using the sequential scheduling policy:
We also performed a set of test with range=maximum , causing the
PPs to be spread across the drives as shown in the figures below. Again, write
performance was poor compared to the non-mirrored case, while read performance was better.
Note that read performance did not improve as much for this case as for the range=minimum case;
most likely this was due to adapter contention involved in accessing data for PPs 2 and 5, since
both copies of these PPs resided on the T3, and both T3 physical drives shared a single
FC interface to the AIX hosts.
There are a number of factors that all influence transfer rates when using HPSS. For small files, the transfer rate can can be be dominated simply by the time that it takes to initiate I/O from a client application and propagate the I/O request through the Bitfile Server, Storage Server, and eventually to the mover. For writes, additional overhead time is taken up by time it takes for space allocation. Other factors include the Virtual Volume block size (VV Block Size)that is defined for the Storage Class to the which the device is assigned, the mover's buffer size, the client's buffer size, the transport mechanism used, and the network options used when transferring data between machines via TCP/IP. Within HPSS, for HPSS releases prior to HPSS 4.1.4, disk devices are divided into a maximum of 16384 allocation units, called "Virtual Volume Blocks". This results in a tradeoff when configuring very large devices - either a small number of devices with very large VV Block Sizes can be configured, or, subject to operating system limitations, the device can be carved into a (potentially large) number of small devices (logical volumes on AIX). This tradeoff is important both when considering the size of files that will be stored, and when determing optimal buffer sizes for transfers. If large VV block sizes are used, but small files are stored, then large amounts of disk space can be wasted. For file transfers, the tradeoff comes into play primarily when HPSS striped I/O is being used. In this case, each mover in the stripe will be transfer consecutively numbered VV blocks, and will transfer at most one VV block per exchange with the client. If the VV block size is too small, then the message overhead for the message exchanges can adversely affect performance. If the VV block size is too large, then both client and mover require very large buffers in order to transfer data in parallel.
We ran an extensive series of tests, varying the mover buffer size from 1 to 16 megabytes, in increments of 1 megabyte, and varying the HSI client buffer size from 512K to 32MB. For these tests, the Texas Memory box was configured with a 1-way striped storage class, using a 4MB VV block size. The HSI client buffer size was initially set to 512K, then increased to 1MB, and increased in units of 1 MB until it reached 32MB. All tests used shared memory buffers within the S80 for the transfers. For each test, a set of files ranging in size from 512K to 1 GB was written to and read from HPSS, using /dev/zero as the source device when writing to HPSS, and /dev/null as the sink device when reading from HPSS. Each test was run 3 times, and the results averaged. Some sample results are shown below:
When the HSI buffer size was increased to 4MB, write performance did not seem to be affected, except for file sizes >= 32MB; for these sizes, odd-size mover buffers again caused degradation
Conventional wisdom dictates that better transfer rates should be achieved with larger client buffer sizes, however, as can be seen below, this was not the case:
To look at it another way, here is a chart showing transfer rates using a fixed (4MB) mover buffer size and varying the client buffer size. Best rates were achieved when the client buffer size matched the 4MB mover buffer size, which also matched the VV block size.

As a comparison between DD rates and HSI rates, here are two charts showing the DD rates achieved while writing and reading a 1 GB file, using various block sizes, and the equivalent HSI rates for puts and gets of a 1 GB HPSS file. For this chart, a fixed-size HPSS mover buffer size was used (4 MB and 16 MB), and the HSI client buffer size was varied from 512KB to 32MB.
HPSS and HSI were able to achieve a nearly the same rates as the raw DD transfers for both reading and writing, although writes were not quite as close to actual device rate as reads. This is most likely due to the latency involved in HPSS's current space allocation algorithms, which requires coordination between the Bitfile Server, Storage Server, and SFS for each disk storage segment.
Since the RAM-SAM 520 has essentially zero latency, we ran a series of tests to compare random HPSS I/O on the RAM-SAN and on both the IBM SSA and Sun T3. For these tests, a simple HPSS client API program was written, which wrote a 512MB test file, then performed a series of 1000 random seeks and random-sized reads.
We ran this test using 1, 2, 5 and 10 processes. The test results are shown below:
Somewhat surprisingly, for a single copy of the test program, the random I/O test took approximately the same time to complete, regardless of which device was used to contain the HPSS data files. When 5 concurrent processes were run, the SSA disk took approximately 22% longer to complete the tests than the other two devices. When 10 concurrent processes were run, the Sun T3 took approximately 10% longer than the RAM-SAN, while the SSA took 3 times as long as the RAM-SAN.
Our conclusion is that the latency for issuing I/O within HPSS was the main factor in the tests involving a small number of processes. The RAM-SAN's inherent transactional superiority did not come into play in these tests until a large number of concurrent processes were active.
Since HPSS also supports random I/O on direct-to-tape Classes of Service, we also ran the test using an STK 9840 drive for the HPSS data file, primarily to verify that it works with this (relatively new at the time of the test) device type:
Our testing showed that using AIX logical volume mirroring, as a means of compensating for the lack of persistent backup within the device, caused severe write performance degradation, most likely due to the extra bookkeeping required when the Mirror Write Consistency policy is enabled. On the other hand, read performance improved substantially when parallel scheduling was enabled.
HPSS tests did not reveal any particular strengths or weaknesses that could be directly attributed to the use of the RAM-SAN as a data device. HPSS and the HSI client program were able to achieve between 94% and 96% of the raw device transfer rates for writing, and between 94.8% and 98.9% of the raw device rates for reading.
Some interesting interactions between VV block size, mover buffer size, and client buffersize were revealed when the sequential transfer test suite was run. In particular, the sharp performance drop for 32MB files, for particular buffer combinations, would be a useful topic for further investigation.
Finally, while the RAM-SAN 520 is probably not a cost effective device as an HPSS data device, it appears to be ideal for environments where data is written relatively infrequently compared to the number of times it is read, or where relatively small chunks of data are written at a time (as opposed to the HPSS large-file streaming environment). In addition, its nearly zero seek latency and high transaction rate make it ideal for applications such as web servers, online reservation systems, realtime data acquistion systems, or applications requiring high speed random access to data.