Why is the speed at which data is transferred over networks seemingly not keeping up with the touted improvements in the hardware through which this data travels?? We have optical networks limited only by the speed of light, 10GbE switches, and multiple, hyperthreaded Ghz processors inside our computers. Yet it is a verifiable fact that today's applications are having trouble delivering needed thruput even as network bandwidth increases and hardware --both inside and outside our own computers--increases in capacity and power. Here at Oak Ridge National Laboratory we are particularly interested in improving bulk transfers over high-bandwidth, high-latency networks because of our involvment in storage and in the transfer of data for cutting-edge scientific applications. It is important to remember however that aggressive tools, such as the ones described in this review and including parallel TCP, are intended for high-speed, dedicated links or links over which quality of service is available--not for general use over the broader Internet!
2.1. Tuning TCP
ORNL has looked at many ways to address this problem. Extensive tests have been conducted attempting to pinpoint problems and suggest possible solutions for the "fixable" problems found. We are partners in the NET100 / WEB100 project which is an effort to look "inside" the kernel by expanding the kernel instrumentation set and making these kernel variables available for reading/tuning through the linux /proc file interface. We have written and modified a number of tools which make use of this newly accessible information--the WAD, iperf100, ttcp100, WEBD, TRACED, and a web100-enabled bandwidth tester . We have also simulated TCP using the TCP-over-UDP test harness, atou , so that we might see the effects of changes to TCP proposed by Sally Floyd and others. The claim is that TCP, which has been the workhorse of the past, is a part of the problem in today's high-bandwidth, high-latency networks primarily because some of the very algorithms which have served it well in the past are unable to keep the pipe full when the bandwidth is high, round trip times are long, and there is often re-ordering/loss in the path.
Slow Start is the way TCP initiates data flow across a connection. Congestion Avoidance is the way TCP deals with lost packets. Congestion Avoidance and Slow Start require that two variables be kept for each connection--a congestion window(cwnd) and a slow start threshold size(ssthresh).
Slow Start operates by observing that the rate at which new packets should be injected into the network is the rate at which the acknowledgments are returned by the other end. Because of this, when round-trip times are long, much more time elapses before the capacity of the path is reached. TCP continues to double cwnd for each ACK received until a timeout occurs or duplicate ACKs are received indicating a lost packet.
Upon receiving 3 duplicate ACKs, cwnd is halved and that value is stored in ssthresh. If there is a timeout, cwnd is reset to 1 or 2 segments. If cwnd is less than or equal to ssthresh, TCP is in Slow Start; else Congestion Avoidance takes over and cwnd is incremented by 1/cwnd for each ACK. This is an additive increase, compared to Slow Start's exponential increase. Again, a flow is penalized when round-trip times are long and the large capacity of the path may never be reached.
Some researchers have concluded that the solution requires a new, udp-like protocol. NETBLT, a transport level protocol, has been proposed to permit the transfer of very large amounts of data between two clients. NETBLT is designed to have features which minimize the effects of packet loss, delays over satellite links, and network congestion.
NETBLT is a bulk data transfer protocol described in rfc998 and proposed in the 1985-87 timeframe as a transport level protocol having been assigned the official protocol number of 30. NETBLT is included in the background section because at least two of the newer UDP protocols have acknowleged their debt to the design of NETBLT and the design of a third is clearly based on the NETBLT ideas. However the resistance to new kernel-level protocols plus the lengthy approval process seems to have influenced the authors of the new UDP protocls to implement their designs at the application-level.
NETBLT was designed specifically for high-bandwidth, high-latency networks including satellite channels. NETBLT differs from TCP in that it uses a rate-based flow control scheme rather than TCP's window-based flow control. The rate control parameters are negotiated during the connection initialization and periodically throughout the connection. The sender uses timers rather than ACKS to maintain the negotiated rate. Since the overhead of timing mechanisms on a per packet basis can lower performance, NETBLT's rate control consists of a burst size and a burst rate with burst_size/burst_rate equal to the average transmission time per packet. Both size and rate should be based on a combination of the capacities of the end points as well as that of the intermediate gateways and networks. NETBLT separates error control and flow control so that losses and retransmissions do not affect the flow rate. NETBLT uses a system of timers to ensure reliability in delivery of control messages and both sender and receiver send/receive control messages.
rfc998 gives the following explanation of the protocol: "the sending client loads a buffer of data and calls down to the NETBLT layer to transfer it. The NETBLT layer breaks the buffer up into packets and sends these packets across the network in datagrams. The receiving NETBLT layer loads these packets into a matching buffer provided by the receiving client. When the last packet in the buffer has arrived, the receiving NETBLT checks to see that all packets in that buffer have been correctly received. If some packets are missing, the receiving NETBLT requests that they be resent. When the buffer has been completely transmitted, the receiving client is notified by its NETBLT layer. The receiving client disposes of the buffer and provides a new buffer to receive more data. The receiving NETBLT notifies the sender that the new buffer is ready and the sender prepares and sends the next buffer in the same manner."
As described, the NETBLT protocol is "lock-step". However, a multiple buffering capability together with various timeout/retransmit algorithms give rise to the claim that NETBLT gets good performance over long-delay channels without impairing performance over high-speed LANs. NETBLT, however, is not widely implemented.
3. Recent UDP application level protocols
The following protocols have not yet been described in an rfc as far as could be determined so the overview of each is taken from papers or documentation produced by the authors of the protocols. In some cases, where descriptions did not seem clear or documentation was sketchy, we did look at the code for further clarification. These protocols are still in transition. This means, among other things, that the details which follow are a snapshot in time and not a final picture.
SABUL, RBUDP, FOBS and TSUNAMI all address the problem of limited thruput over high-speed, high-bandwidth networks with proposals which are similar in concept but slightly different in implementation. RBUDP(QUANTA) will not be evaluated at this time. QUANTA has some very interesting ideas such as forward error correction but has not implemented this and does not yet do file transfers as it is still in the very early stages of development.
3.1. Use of TCP/UDP channel(s)
All three proposals use one or more TCP connections for sending/receiving various control information and one UDP connection for sending/receiving the actual data. Once the connections are established, the control packets are sent from receiver to sender for SABUL and TSUNAMI but control packets are sent both ways in FOBS . For all three, the data is sent only one way--from sender to receiver.
3.2. Rate-Control Algorithm
Rate control is seen as a way to control the burstiness so often observed in TCP flows. This burstiness may cause losses as router queues suddenly fill up and packets are dropped or network interface cards cannot keep up. Also, rate control allows a flow to more quickly fill a pipe without going through the initial ramping-up process characteristic of TCP. Rate control or inter-packet delay, which adjusts to packet loss and/or network congestion as reported by the receiver, has been added in some form to all the UDP protocols to counter the charges of unfair use of capacity and potential to create network problems.
TSUNAMI gives the user the ability to initialize many parameters including UDP buffer size, tolerated error rate, sending rate and slowdown/speedup factors with the 'set' command. If the user does not set sending rate however, it starts out at 1000Mbs with a default tolerated loss rate of 7.9%. Since a block of file data(default 32768) is read and handed to UDP/IP, the rate control is actually implemented per block rather than per packet. The receiver uses the combination of the number of packets received (a multiple of 50) and a timed interval (>350ms) since the last update to determine when to send a REQUEST_ERROR_RATE packet containing a smoothed error rate. If the error rate is greater than the maximum tolerated rate, the sending rate is decreased; if it is less, the sending rate is increased.
FOBS asks for the local and remote network interface card speed which it uses to determine a maximum beginning rate. The default tolerated loss rate is 1%. FOBS calculates a table of rates during sender initialization linking these rates to a network state machine. After a segment of data(about 10000 1466-byte packets) has been transferred, the sender requests an update from the receiver. The reported packet loss from the receiver is used then to calculate the current bandwidth. The current bandwidth is compared against the pre-calculated table values to determine the current state of the network and pull a corresponding rate from the table.
SABUL begins with a preset IPD (inter-packet delay) of 10 usec which it converts to CPU cycles. The receiver generates a SYN packet based on a timed interval(200ms) which signals the sender to use both the number of lost packets and the number of packets--including retransmits--sent since the last SYN time to calculate a current loss rate. This loss rate is then input to a weighted moving average formula to give a history-rate. If the history-rate is greater then a preset limit(.001), the IPD is increased; if less than the limit, the IPD is decreased; if equal, .1 is added. In a former release, SABUL attempted to keep the loss rate between an upper and lower limit. The latest implementation is similar in concept to TSUNAMI's in that both keep the delay between blocks/packets between an upper and lower limit.
SABUL is the only one of the three to implement the IPD(inter-packet delay) between individual packets as opposed to groups of packets. The delay is implemented by repeated calls to rtdsc() until the requisite number of clock cycles have passed. FOBS checks the sending rate after a burst of packets(25) and implements the delay with a gettimeofday calculation until time to send the next burst. TSUNAMI uses a timed select() to implement the delay between blocks of data.
4. Test Results and Analysis
conditions of the tests
The tests were run morning and afternoon over a period of several months. The results varied widely and so the ones included are members of best/worst case scenarios. Tests were also run using NISTNET for a sanity check since we did not have access to simulator code for the protocols.
characteristics of the computers at each end
On the ORNL side, firebird is a Dell Precision 330 single-processor host running a 2.4.20-web100 linux kernel with a SysKonnect Gigabit Ethernet network interface card connected to a Fast Ethernet network(CCS 192 LAN). Firebird has a 1.4GHz Intel Pentium IV processor, a 24GB Maxtor hard drive and 512MB of memory. Tests indicate speeds of approximately 4400Mbs+ writing to and 12000Mbs+ reading from disk.
net100.lbl.gov is a single processor host running a 2.4.10-web100 linux kernel with a NetGear|GA630 Gigabit Ethernet interface card connected to a Fast Ethernet network. net100.lbl has a AMD Athlon(tm) 4 Processor stepping 02 running at 1.4GHz and 256MB of memory. Tests indicate speeds of approximately 73.276Mbps+ writing to and 47.9Mbps+ reading from disk.
characteristics of network between
The results of ping tests:
ping -c 10 18.104.22.168Running traceroute shows use of the ESnet OC48 backbone with basic route symmetry although variations can occur.
round-trip min/avg/max/mdev = 67.209/67.490/67.657/0.324 ms ffowler@net100(145)>ping -c 10 22.214.171.124
rtt min/avg/max/mdev = 67.287/67.748/68.219/0.333 ms
ORNL to LBL: Application Rate(Mbs) Loss/Retrans/PktsSent/PktsRecv/Dups BEST CASE *QUANTA 358.28 73/ 73/ 137742/ 137669/0 FOBS 317.17 2816/ 6720/ 143147/ 138436/2009 SABUL 343.49 15/ 15/ 140015/ 140005/5 **TSUNAMI 307.23 55/ 259/ 13361/ 13102/204 ***iperf 489.0 0/ 0/ 145499/ 145499/0 iperf100 -P 3 279.7 6/ 6/ 171303/ 171299/2 LBL to ORNL: WORST CASE *QUANTA 310.37 4415/ 4415/ 142157/ 137742/0 FOBS 256.17 2170/ 6103/ 142530/ 138484/2057 SABUL 151.76 47838/ 44895/ 184895/ 142441/2441 **TSUNAMI 243.92 1689/ 1689/ 14963/ 13330/228 ***iperf 420.0 3/ 0/ 142846/ 142843/0 iperf100 -P 3 229.8 5/ 7/ 145491/ 145484/2 *QUANTA, as shipped, does not adjust the rate even if there are losses! **TSUNAMI was giving 16384 byte blocks to ip and letting ip fragment ***iperf does not buffer nor retransmit losses and so should be fasterTSUNAMI does not provide a memory-to-memory transfer option so the file accesses--fseek, fread, fwrite--were commented out for this test.
Application Rate(Mbs) Lost/NewDataSent/PktsRcvd/Dups BEST CASE FOBS+ 45.01 2668/ 146421/ 151624/5203 SABUL 52.14 5463/ 146220/ 178297/32077 TSUNAMI* 47.17 44400/ 146419/ 181840/35421 GRIDFTP 33.79 (4224.00KB/sec) 3 streams WORST CASE FOBS 28.52 3503/ 146421/ 147787/1366 SABUL 37.25 4200/ 146220/ 216482/70262 TSUNAMI* 8.2--had to Ctrl/C GRIDFTP 13.12 (1638.31KB/sec) 1 stream *TSUNAMI was sending 1466 byte blocks +FOBS would have achieved 243.4Mbs if not counting the wait for file writes!! QUANTA does not have a file transfer option at this timeFile transfers to pcgiga at Cern gives a longer RTT with the following results:
BEST CASE Application Rate(Mbs) Lost/Retrans/PktsSent/PktsRcvd/Dups iperf 469.0 0/ 0/ 147563/ 147563/0 FOBS 247.78 78/ 14188/ 160609/ 155792/9371 SABUL 250.88 3/ 3/ 146223/ 146222/2 TSUNAMI 270.74 0/ 4980/ 146419/ 146901/482 WORST CASE FOBS 153.03 23/ 14245/ 160666/ 156133/9712 *SABUL 139.52 1708/ 91856/ 238076/ 191202/44982 TSUNAMI 4.30--had to Ctrl/C *SABUL had 4 EXP events in which all packets "in flight" are assumed lost
4.3. Tests using a LAN, a private network and NISTNET
To validate results obtained over the broader Internet where conditions are unpredictable and constantly changing, tests were performed under more controlled conditions using NISTNET. NISTNET is running on an old, slow Gateway machine, viper, with 64MB of memory and two 100Mbs Network Interface cards. One NIC is connected into a NETGEAR Fast Ethernet Switch and one into a local area network. The other two machines involved are dual processor pcs previously used in a cluster and have 512MB of memory and a 100Mbs Network Interface card. Pinto is connected into the same local area network and pinto10 is connected into the NETGEAR Switch. Conditions are not completely controlled in the local area network, but are observed to be mostly stable with rare exceptions.
With NISTNET configuration:
pinto.ccs.ornl.gov pinto10 --delay 35.000 pinto10 pinto.ccs.ornl.gov --delay 35.000    pings from pinto to pinto10 gave the following:
[ffowler@pinto ffowler]$ ping -c 3 10.10.10.2 round-trip min/avg/max = 70.5/70.7/71.0 ms [ffowler@pinto10 ffowler]$ ping -c 3 126.96.36.199 round-trip min/avg/max = 70.6/70.6/70.6 ms4.3.1. NISTNET Memory-to-Memory Transfers
Application Rate Lost/Retrans/PktsSent/Dups iperf(TCP) -P 3 78.8 Mbits/sec 7/ 7/ 150571/0 iperf (UDP) 90.5 Mbits/sec 0/ 0/ 153849/0 FOBS 80.8 Mbits/sec 1/ 143/ 136570/142 QUANTA 89.9 Mbits/sec 0/ 0/ 137742/0 SABUL 87.1 Mbits/sec 37480/ 17843/ 157823/2832 *TSUNAMI 83.1 Mbits/sec 0/ 1/ 13103/1 *1 16K block is about 12 packetsThe results of 200MByte(+/-) memory-to-memory tests with NISTNET configuration:
pinto.ccs.ornl.gov pinto10 --delay 35.000 --drop 0.0107 pinto10 pinto.ccs.ornl.gov --delay 35.000 --drop 0.0107
Application Rate Lost/Retrans/PktsSent/Dups iperf(TCP) -P 3 44.9 Mbits/sec 20/ 20/ 159662/0 iperf(UDP) 90.4 Mbits/sec 17/ 0/ 153849/0 FOBS 79.3 Mbits/sec 16/ 1215/ 137642/1199 QUANTA 87.5 Mbits/sec 15/ 15/ 137727/0 SABUL 83.5 Mbits/sec 47744/ 14172/ 154172/4230 *TSUNAMI 82.5 Mbits/sec 18/ 18/ 13151/49 *18 16K blocks are about 12*18 = 216 packets 13151 16K blocks are about 12*13151 = 157812 packets 49 16K blocks are about 12*49 = 588 packets4.3.2. NISTNET File Transfers. For real file transfers over the NISTNET testbed from pinto to pinto10, NISTNET was configured as above. The file transferred is 214649928 bytes. The drop rate--a little over .01%--consistently results in 16 to 20 packets being dropped. This is confirmed by results from FOBS which reported 17 lost packets, iperf-UDP reported 16 packets lost and iperf-TCP reported 20 packets lost.
A summary of the results from file transfers using NISTNET are shown in the table below:
Application Rate Lost/Retrans/PktsSent/PktsRecd/Dups FOBS* 71.4 17/ 4754/ 151175/ 150537/4116 SABUL** 87.2 11544/ 19911/ 166131/ 148317/2097 TSUNAMI*** 33.2 181210/ 207317/ 354961/ 176184/29765 *FOBS sends 1466 byte data packets **SABUL sends 1468 byte data packets ***TSUNAMI was configured for a blocksize of 1466 bytes so comparisons could be made with the other two protocols but as shown in the tables above, TSUNAMI gets better thruput when using a larger block size and letting IP fragment--this is partly because TSUNAMI is written so that "blocksize" controls not only the size of the blocks sent but also the size of the file reads and writes.4.3.3. Analysis of LAN test results
delay = ((sending-time + 50) < ipd_current) ? (ipd_current - delay - 50)) : 0;A select() call is used to implement the delay.
5. Summary Comparison of Protocol Features
|TCP Control Port|| Yes--Control packets are sent from receiver to sender
The sender can also generate and process a pseudo control packet upon the expiration of a timer
|Yes--Control packets are sent from receiver to sender|| Yes--2 control ports are used
Control packets are sent both ways
|UDP Data Port||Yes--Data is sent from sender to receiver||Yes--Data is sent from sender to receiver way||Yes--Data is sent from sender to receiver|
|Threaded Application|| Main thread does file I/O
2nd thread keeps track of timers and sends/receives packets
|Server forks a process to handle receiver's request
Receiver creates a thread for disk I/O
| The sender and receiver are NOT threaded
fobsd is but was not used for these tests
|Rate Control||Yes--Inter-packet delay implemented by continuous calls to rtdsc()||Yes--Inter-block delay implemented by calls to gettimeofday() and select()||Yes--Inter-block delay implemented by continuous calls to gettimeofday()|
|Tolerated Loss||0.1%||user can set--7.9% default||1.0%|
|Authentication||No||Yes--via a shared secret||No|
|Packet Size||No||Yes--via a shared secret||No|
|Socket Buffers||No||Yes--via a shared secret||No|
|Congestion Control||No||Yes--via a shared secret||No|
|Reorder resilience||No||Yes--via a shared secret||No|
|Duplicates||No||Yes--via a shared secret||No|
|Diagnostics||No||Yes--via a shared secret||No|
TSUNAMI, SABUL and FOBS all do well as long as there is little reordering or loss. Reordering and/or loss seem to cause them all to transmit uneeded duplicates. FOBS gives priority to new data and transmits a CHUNK(default 100MB) of data before doing retransmits. TSUNAMI and SABUL give priority to requests for retransmits. The SABUL client reports any missing packets immediately as well as periodically every 20ms. The TSUNAMI client reports missing packets after the requirements for numbers of packets received (a multiple of 50) and amount of time since the last report (more than 350ms) have been satisfied.
In addition, TSUNAMI continues to retransmit the last block/packet until receiving a REQUEST_STOP from the receiver. In one short, loss-free transfer from ORNL to LBL of 8,135 blocks/packets of size 1466, the last block was observed being transmitted 11,876 times. The client actually received 7,835 of the 11,876 before it quit. REQUEST_RESTART requests seem to cause instability. When REQUEST_RESTARTs are sent by the receiver, often both sender and receiver had to be manually stopped as both somehow seemed to get confused and the rate fell below 4Mbps. This happened regularly on transfers from ORNL to LBL with files of 100mb or more. The message, "FULL! -- ring_reserve() blocking" also appeared at the client regularly during the transfer of large files. In order to complete the necessary file transfers with TSUNAMI, the retransmit table was enlarged and a block size of 16384 used in an attempt to eliminate REQUEST_RESTART requests.
FOBS also transmits many unecessary packets. In a file transfer involving 146,421 data packets(214649928 Bytes) and no losses, FOBS actually sent 150,100 packets. These packets are apparently sent while the sender is awaiting instructions from the receiver telling it what needs to be done next. In this case, after sending the first chunk, packets 70000-71229 were re-transmitted. Similar re-transmissions occurred after the second and third chunks. The receiver read all 71527 packets in the first chunk, sent a COMPLETEDPKT and got ready for the next chunk. Before reading the first packet in the next chunk, packets 70000-71229 were read and thrown away. A 'scaleFactor' is used to keep the packets in synch with the correct iteration or the receiver might assume these retransmits are part of the next chunk.
SABUL also has problems when there is significant reording/loss. Since the client reports every perceived loss immediately, this can mean a lot of control packets. In one case, the sender was observed having to deal with one control packet for almost every data packet it was sending. Since a tabulation of lost packets is also sent every 20ms, the same loss may be reported more than once and the sender will count it again for rate control purposes which gives rise to an interesting phenomenon. If the tabulation of losses for the last period is greater than the number of packets sent during the same period, the sender solves the problem by assuming 100% loss for that period.
With a shorter RTT--this scenario was with 150 usecs--perhaps not as many unecessary packets would be sent but it is assumed these UDP protocols are meant for transferring large files on high-bandwidth, high-delay networks.
After much testing and studying, it is still not clear why SABUL and TSUNAMI report so many losses. One clue may be that the protocols that wait 10000 packets(FOBS) or more(iperf & quanta) to report losses do much better. That would seem to indicate that packets may not be received in strict order but more study needs to be done on this problem.
TCP/Internet friendly?? Possibly...Possibly not...SABUL is the only one that makes any real claims of fairness. The efficiency and fairness of SABUL's rate-control algorithm is discussed in Rate Based Congestion Control over High Bandwidth/Delay Links. TCP friendliness is defined as: "A protocol P is TCP friendly if the coexisting TCP connections share approximately the same amount of bandwidth as if P is a TCP connection." The paper presents charts of experiments done to show that SABUL is TCP friendly and therefore fair.
The paper, An Analysis of SABUL Data Transfer Protocol presents the results of NS2 simulations involving multiple SABUL connections and also multiple SABUL and TCP connections. The conclusions reached using their model are that:
The UDP protocols were basically designed to work over networks with high bandwidth, long delay and little or no loss or reordering. With the possible exception of SABUL, they were not meant to share bandwidth fairly with other streams but were designed to run in reserved slots where they could run as fast as the hardware and routers involved will allow. As results show, at present they do not have the "smarts" for dealing effectively with loss or reordering.
TCP rising to the challenge?
As mentioned, there is a lot of work going on in an effort to "spruce up" TCP to meet today's demands. One one of the topics under consideration is the size of the maximum transmission unit(MTU). Matt Mathis at the Pittsburgh SuperComputing Center is providing proposals for discussion . Also, Stanislav Shalunov discusses the virtual MTU.
Using rate-based rather than window-based transmittal of packets means the UDP protocols do not have to "ramp-up" at the beginning nor "cut the window in half" when a loss is perceived. There is ongoing work, however, to experiment with modification of these aspects of TCP with Sally Floyd's proposals for TCP with large congestion windows and Tom Kelly's scalable TCP algorithm.
future planned tests
Since these tests were done, new versions of SABUL and its derivative, UDP-based Data Transport Protocol(UDT), have been developed; a new version of QUANTA/RBUDP has been made available; implementations in user space and the linux kernel of Datagram Congestion Control Protocol(DCCP) are available for download. We hope to look at these new implementations under a variety of conditions and particularly with tests planned on the new ???...Have Steven write a few words about the tests he and Nagi have planned...