Improve ORNL-NERSC effective network bandwidth, Experimental
When we began, production transfer rates between the sites were averaging roughly 250 kilobytes/second. FTP transfer rates were improved to an average of 4 megabytes/second (limited chiefly by network contention) and a peak of over 9 megabytes/second (limited by FDDI links at both ends). The lessons learned can be applied to a production environment and spawned the spinoff project.
Some testing was done with ftp transfers of large files over ESnet between Probe hosts at ORNL and NERSC. The ORNL host was an IBM RS/6000 model S80 (stingray); the NERSC host was an IBM RS/6000 model F50 (swift). At this time, stingray had a Gigabit Ethernet interface, but swift had only a 10/100Mbps Ethernet interface. In addition, although an OC3 link (155Mbps) between the sites exists, both sites were routing all network traffic through FDDI backbones (100Mbps). So the theoretical maximum throughput was never greater than 100Mbps, or about 10MB/sec.
Preliminary tests with ftp showed that, no matter how fast the network, file transfer speeds would never exceed about 250KB/sec. It was found that this is due to the TCP socket buffer sizes used by ftp. In theory, for TCP transfers over long distances which introduce a significant delay, socket buffer sizes should be:
buffer size = (nominal bandwidth) * (round trip delay)
Using "ping" showed that the delay between ORNL and NERSC was approximately 60ms. Assuming a nominal bandwidth of 10MB/sec, then:
buffer size = (10MB/sec) * (0.060 sec)= 600KB
The socket buffer must be larger for networks with delays in order to "keep the pipeline full" while the sender is waiting for ACKs from the receiving end. This is explained in Internet RFC1323.
On AIX systems, the default TCP socket buffer sizes (both send and receive) are 16KB. This is far below the optimum calculated above. There are three methods of adjusting these buffer sizes:
They can be changed with the "no" command. For example:
no -o tcp_sendspace=262144
no -o tcp_recvspace=262144
no -o rfc1323=1
This would set the buffer sizes to 256KB. The "rfc1323" option is a boolean which must be set to 1 to allow buffer sizes larger than 64KB.
This method will change the default buffer sizes for all network operations on the host. It is not really desirable to set very large buffer sizes as system-wide defaults.
They can be changed for a selected network interface. For example:
ifconfig en1 tcp_sendspace 262144
ifconfig en1 tcp_recvspace 262144
ifconfig en1 rfc1323 1
This would set larger buffer sizes just for interface en1. This method could be appropriate if, for example, one network interface on a host was reserved strictly for high-speed file transfers. It would be undesirable to do this for a more general-purpose network interface.
They can be changed by individual programs wishing to use TCP buffers larger than the system defaults. Such programs would have to include logic to call the "setsockopt" system call for setting the proper socket options to enable larger buffers. This solution seems most appropriate in the long term, but is also the most difficult to implement.
In all cases, both the sending and receiving host and/or application must cooperate in their choices of TCP buffer sizes. For one side only to increase the buffer sizes would do no good.
The trouble with solution (3) is that standard file transfer utilities such as ftp do not include any such logic to adjust buffer sizes. As an experiment, ftp client and ftp daemon source code was obtained, and modified to add buffer size selection logic.
The code was modified to call setsockopt with the SO_SNDBUF or SO_RCVBUF options, as appropriate, to set send or receive socket buffer sizes. On AIX systems, another call to setsockopt must be made with the TCP_RFC1323 option to enable buffer sizes larger than 64KB. The TCP_RFC1323 option is specific to AIX; other operating systems may have other options to allow large buffers.
Once the modified ftp and ftpd were rebuilt, the modified ftpd was installed on stingray, and the modified ftp client was used on swift to transfer large files between the two hosts. Buffer sizes from 128KB up to 2MB were tried. In general, once buffer sizes got above 512KB, file transfer rates rose to around 5MB/sec, with peaks hitting 9MB/sec (close to the theoretical limit).
With transfer rates this high, several problems became evident:
When a transfer approaches a certain speed, network contention or other random factors may cause TCP to drop some packets. When this happens, TCP essentially halts the transfer, "drains the pipe" of all packets currently in transit, and then starts trying the remainder of the transfer at a very slow rate. The transfer speed increases linearly as TCP slowly ramps back up, trying to avoid more dropped packets. When this happens, the average transfer rate for one large file goes way down. As yet, no good solution for this problem has been found.
It was discovered that these high transfer rates tended to use a large percentage of the bandwidth of the FDDI network backbones at both ORNL and NERSC. This meant not only that other network users could disrupt the Probe file transfer measurements, but also that network managers at both sites began complaining that the tests were swamping their networks.
Due to both of the above factors, it was very hard to find a truly optimum TCP buffer size for file transfers. Above a certain point, disruptions in the file transfers caused the overall transfer rates to fluctuate enough that, while the rates were still interesting "ballpark" numbers, they could not really be compared accurately.
These ftp tests were not intended to be formal or exhaustive. They were intended to reveal the potential performance of a standard file transfer utility which could take advantage of larger TCP socket buffers, and they did that. Despite the erratic results, it was shown that ftp transfer rates could be increased from the usual 250KB/sec to over 5MB/sec.