Almost TCP over UDP (atou)

As part of our Net100 and Probe efforts in improving bulk transfers over high speed, high latency networks, we have developed an instrumented and tunable version of TCP that runs over UDP. The UDP TCP-like transport serves as a test-harness for experimenting with TCP-like controls at the application level similar to TReno. It is not a full transport protocol that one could use as the basis for a file transfer program (though we may consider that in the future). The implementation provides optional event logs and packet traces and can provide feedback to the application to tune the transport protocol, much in the spirit of Web100 but without the attendant kernel modifications. Part of the motivation for developing such a test harness arose from bugs in the AIX TCP stack.

The experimental UDP protocol includes segment numbers, time stamps, selective ACKs, optional delayed ACKs, sliding window, timeout-retransmissions with rate-based restart, bigger initial window, bigger MSS, burst avoidance, congestion avoidance (but more aggressive, experimenting with initial window size and AIMD parameters). The protocol continues to transmit on the first dup ACK's (limited transmit, RFC3042). The test harness consists of a configurable client (transmitter) and an ACK-server (receiver). The transmitter supports Reno, NewReno, and SACK/FACK and more recently HS TCP, scalable TCP, and TCP Vegas. (For a quick review of TCP, visit our bulk transfer page.

The protocol header includes timestamp and segment number. The receiver's header includes the transmitter's timestamp, the segment number of the last good contiguous segment and up to three optional two-word blocks of SACK info.

The cumulative ACK is just like TCP's, with optional delayed ACK support.

A simple configuration file controls the transmitter. The following aspects of the transmitter can be configured:

The receiver can be configured with a RCVBUF size, port number, SACK, delayed-ACK (amount of delay or none), and debug level. The receiver keeps track of missing and duplicate segments and supports a TCP-like cumulative acknowledgement scheme.

The transmitter behaves like TCP with Reno, Newreno, or SACK/FACK using slow-start and congestion avoidance. Upon termination a summary of the configuration and session statistics are printed as below.

config: portcli $Revision: 1.29 $ port 7890 debug 4 Tue May 22 05:18:54 2001 config: initsegs 2 mss 1460 tick 1.000000 timeout 0.500000 config: rcvrwin 300 increment 1 multiplier 0.500000 thresh_init 1.000000 config: newreno 1 sack 0 delack 0 maxpkts 12000 burst_limit 0 dup_thresh 3 config: sndbuf 32768 rcvbuf 32768 swift => stingray.ccs.ornl.gov 38.835049 Mbs win 300 rxmts 0 3.609111 secs 17520000 good bytes goodput 4854.381103 KBs 38.835049 Mbs pkts in 11542 out 11999 enobufs 0 total bytes out 17518540 loss 0.000 % 38.831813 Mbs rxmts 0 dup3s 0 packs 0 timeouts 0 dups 0 badacks 0 maxack 24 maxburst 300 minrtt 0.065249 maxrtt 0.098987 avrgrtt 0.077747 rto 0.067194 srtt 0.066705 rttvar 0.000122 win/rtt = 45.069119 Mbs bwdelay = 377 KB 258 segs snd_nxt 12000 snd_cwnd 300 snd_una 12000 ssthresh 300 snd_max 12000 The receiver also reports effective bandwidth, number of ACKs sent, and number of segments dropped (or arriving out of order) and the number of duplicated segments.

With a debug level of 4 or higher, a trace file is produced for each transmit and ack packet, data includes time stamp, segment number, round-trip time, congestion window, and threshold.

0.408133 258 xmt 0.408196 234 0.038062 26 26 ack 0.408259 259 xmt 0.408346 235 0.038069 26 26 ack From the trace file, one can plot tcptrace-like information (RTT over time, cwnd over time, average and instantaneous bandwidth, ack-sequence number, etc). Writing the trace file can slow performance for high speed interfaces. With the drop-list, one can induce loss for testing purposes.

Depending on the debug level, events (timeouts, retransmissions, etc.) are loggged on stderr.

3duprxmit pkt 416 nxt 616 max 616 cwnd 200 thresh 200 recover 0 timerxmit pkt 417 snd_nxt 616 snd_max 617 snd_cwnd 100 thresh 100 3duprxmit pkt 2744 nxt 2844 max 2844 cwnd 100 thresh 100 recover 0 packrxmit pkt 2747 nxt 2844 max 2845 cwnd 100 thresh 50 recover 2844 una 2744 packrxmit pkt 2750 nxt 2845 max 2845 cwnd 98 thresh 50 recover 2844 una 2747 badack 6636 snd_max 6808 snd_nxt 6808 snd_una 6661 badack 6892 snd_max 7048 snd_nxt 7048 snd_una 6899 The events above illustrate a retransmit resulting from a three duplicate ACKs, and a retransmit resulting from a timeout. The packrxmit are NewReno (or SACK) retransmits during "recovery". A badack is an out of order ACK or duplicated ACK. The event log and trace-file plots can indicate if the path is getting multiple losses in a window, or single losses, or packets out of order.

The following two excerpts contrast NewReno and SACK. The data sets for the plots are created from the trace file (debug 4). The example below shows a packet loss at time 2.22, and five other losses within the same RTT. NewReno sends a few new packets when after cwnd is halved, then gets a partial ack, and retransmits packet 2225 after one RTT (95 ms), and then retransmits packet 2227 after another RTT, and so on. So full recovery (2.83) takes over six RTT's.

3duprxmit 2.22 pkt 2210 nxt 2335 max 2335 cwnd 123 thresh 179 recover 1979 packrxmit pkt 2225 nxt 2345 max 2345 cwnd 135 thresh 73 recover 2335 una 2210 packrxmit pkt 2227 nxt 2351 max 2351 cwnd 126 thresh 73 recover 2335 una 2225 packrxmit pkt 2230 nxt 2355 max 2355 cwnd 128 thresh 73 recover 2335 una 2227 packrxmit pkt 2234 nxt 2358 max 2358 cwnd 128 thresh 73 recover 2335 una 2230 packrxmit pkt 2237 nxt 2361 max 2361 cwnd 127 thresh 73 recover 2335 una 2234
The next excerpt is from a flow with SACK enabled where there are a sequence of lost packets within a RTT. An initial packet (7149) is lost and retransmitted at 5.85 seconds. SACK provides information to the sender as to what other packets are missing, and so the sender almost immediately retransmits the missing segments as well as some new segments, all within one RTT (95 ms) when the retransmitted packets are finally acknowledged (5.95). Rampdown is also enabled in this example.

3duprxmit 5.85 pkt 7149 nxt 7324 max 7324 cwnd 103 thresh 103 recover 7323 packrxmit pkt 7152 nxt 7324 max 7325 cwnd 103 thresh 103 recover 7323 una 7149 packrxmit pkt 7153 nxt 7324 max 7325 cwnd 103 thresh 103 recover 7323 una 7149 packrxmit pkt 7155 nxt 7339 max 7340 cwnd 103 thresh 103 recover 7323 una 7149

By starting multiple clients and servers (using different port numbers for each pair). you can even do some parallel stream experiments. The TCP-like UDP transport also can make good round-trip estimates from the time stamps. Most TCP stacks use a coarse (0.5 second) resolution timer for estimating round-trip time, and TCP times only one packet per round-trip time. The TCP estimates are complicated by retransmissions and delayed/cumulative ACKs. Newer TCP implementations support a time-stamp option. From analysis of the time-stamp data for our UDP protocol over the ORNL-NERSC link, we chose to initially implement a simple one-second timeout. (Timeouts are rare, and packet loss is usually detected by multiple duplicate ACKs.)

Vegas
In 2002, we (Tom Dunigan and student Ashraf Jahangeer) added TCP Vegas-like options to atou. Each RTT, Vegas calculates the difference (delta) of the expected throughput and actual throughput. Vegas compares this delta with preset thresholds (alpha, beta, gamma), and exits slow start if gamma is exceeded, and adjusts cwnd depending on delta's relation to alpha and beta during congestion avoidance Though Vegas with small values of alpha/beta/gamma doesn't compete well with standard congestion algorithms on the same link, we are interested in experimenting with larger parameter values for high delay/bandwidth paths. Most of the previous work with Vegas has been through simulation. Below is a plot from an ns simulation of TCP Reno continually experiencing losses (insufficient router buffer space) and exhibiting the classic sawtooth performance, yielding only 75% of the link's capacity. In contrast, TCP Vegas avoids losses in this case and throughput reaches the link capacity (The flows weren't competing, they each were simulated independently for 60 seconds, 1.6 Mbs link, 100 ms RTT, delayed ACKs.)

Our atou Vegas allows us to configure alpha, beta, and gamma, as well as select the type of RTT (minimum, last, maximum, or average) used in calculating the actual throughput. Also we have parameter for using Floyd's slow-start when gamma is exceeded. Other differences between our Vegas and the original Vegas paper and the ns implementation include:

As with the other Vegas implementations, no Vegas adjustments are done during loss recovery, so our Vegas will operate with our various AIMD tuning options.

We have experimented using the Vegas gamma threshold during slow-start to initiate Floyd's slow start (rather than max_ssthresh). The figure below shows two transfers using a Vegas implementation over our TCP-over-UDP from PSC to ORNL (OC12, 77ms RTT). The standard Vegas exits slow-start if the Vegas delta is larger than gamma and enters congestion avoidance phase. The standard Vegas (actually using the Floyd AIMD mods) slowly increases the window in congestion avoidance phase, further dampened by the Vegas alpha/beta (1/3 in this test) and only reaches 36 Mbs. The second transfer switches to Floyd's slow-start when gamma is exceeded. Both transfers exceed gamma at about the 0.5 seconds into the transfer. The Floyd version climbs quickly to the limiting window of 2000 segments and reaches an average throughput of 232 Mbs -- nearly a factor of 8 improvement. Interestingly, for these tests, the cwnd value when gamma (1) is exceeded is around 120 segments, close to Floyd's suggested max_ssthresh of 100.

The figure below illustrates three different TCP-over-UDP transfers from PSC to ORNL. The path was exhibiting packet drops during slow-start. The standard slow-start algorithm suffers a loss and recovers using Floyd's AIMD and reaches an average throughput of 198 Mbs after 10 seconds. By comparison, the Vegas-controlled transfers exit standard slow-start when gamma is exceeded and switch to Floyd's modified slow-start and have no packet losses. The Vegas flow with a gamma of 2 reaches 219 Mbs, the flow with a gamma of 60 reaches 229 Mbs.

The following plots show how standard Vegas (alpha=50,beta=80,gamma=400) is able to avoid queueing and reduce latency when the buffers are over provisioned at sender and receiver. The two atou runs were made from a 100 Mb Ethernet host at ORNL across OC12 to PSC (RTT 77 ms). The bandwidth delay product suggests a buffer size of 660 segments, but the transfers use buffers of 1500 segments. The following plot shows the congestion window for standard TCP versus standard Vegas (no Floyd slow-start exit), and the value of the Vegas delta variable during the transfer.

Both flows reach about 85 Mbs at the end of 10 seconds, even though the Vegas flow backs off in slow-start when delta exceeds gamma (400) and goes into linear congestion avoidance, slowly incrementing cwnd, metered by alpha (50) and beta (80). The standard TCP consumes more buffers and as a consequence the RTT is increased. The following plot illustrates how the overprovisioned buffers cause TCP to suffer increased RTT, while the Vegas RTT stays close to the minimum RTT for the path.

Here are links to more Vegas info.

Test Cases

1. ORNL-NERSC
We have used the TCP-like UDP client and server in network configrations other than the two AIX machinies at NERSC and ORNL. We conducted various experiments between swift and stingray, using different buffer sizes, different TCP options within AIX, and different interfaces. We evaluated both TCP and our TCP-like UDP transport. Figure 2 illustrates the throughput behavior of a TCP flow with no loss. (Actually, the data was collected from our TCP-like UDP transport.) TCP (and our TCP-like UDP protocol) starts with a small window and then exponentially increases it until loss occurs or the available bandwidth or receiver's window size is reached. As can be seen in the figure, in less than a second, available bandwidth is reached. The rate or time to reach peak can be calculated analytically as well, since one additional segment (MSS-sized) is transmitted for each ACK received, which takes one round-trip time. We can't do anything about the round-trip time, so if we wish to improve the startup performance we need to have a larger MSS or try a larger initial window size. (See the links page for papers on increasing the initial window size. Our UDP transport has options for an increasing the MSS and initial window size.) TCP reaches a steady-state, the transmitter is regulated by the ACK arrival rate. Note that it takes many seconds over our ORNL-NERSC path to reach capacity.

Figure 2. Average and instantaneous (0.1 sec samples) bandwidth for transfer from stingray to swift using our TCP-like UDP transport.

The startup rate can be slowed by a factor of two if the receiver utilizes delayed-ACKs (sending an ACK for every other segment received).

To improve this recovery rate, one either needs a bigger MSS or a larger additive increase in the recovery algorithm. (TCP adds one new segment per RTT.) Using our UDP transport or the simulator, we can experiment with altering this increment. (Increasing the increment turns out to be "fair", because as it is, if multiple TCP flows are sharing a link and there is congestion, they all halve their rates, but the nodes nearer (shorter RTT) will get more of the bandwidth than the distant node. As with startup, a delayed-ACK receiver also slows down the linear recovery. To futher speed recovery, one could reduce cwnd by less than half. We have done a few experiments adjusting these AIMD parameters with our tests harness. Figure 3a compares standard AIMD parameters (0.5,1) with setting cwnd to 0.9 of its value and adding 10 segments per RTT.

Figure 3a. Standard and agressive AIMD recovery from a single packet loss using our TCP-like UDP protocol between NERSC and ORNL.

See AIMD papers on links page.)

Most TCP-stacks utilize delayed ACKs, our UDP transport can disable delayed-ACKs and profit from a slightly faster startup rate. The effect of delayed-ACK's on throughput is evident in Figure 4. The figure shows average and instantaneous throughput of a TCP-like transfer with and without delayed ACK's. Using the droplist in our TCP-like UDP transport, we have dropped two packets in two transfers from ORNL to NERSC.

Figure 4. Average and instantaneous throughput with and without delayed-ACKs using our TCP-like UDP protocol between NERSC and ORNL.

Even though the transfers are competing with other Internet traffic, the non-delayed ACK transfer starts up faster and recovers from the loss faster.

Using this same simulation configuration, Figure 5 compares average and instantaneous throughput for an Ethernet-sized segment (1500B) versus a jumbo-frame segment (9000B). As noted earlier, both startup and recovery times are effected by the segment size. ATM supports a 9K segment size, and FDDI supports a 4K segment, so if both endpoints had different network interfaces and the intervening path supported the larger segment size, then throughput should be improved.

Figure 5. Average and instantaneous throughput for 1.5K mss versus 9K mss. (ns)

With our TCP-like UDP transport, we can experiment with larger MSS sizes. We used a larger MSS (2944 and 5888 bytes) with our UDP transport and got better throughput. For the same number of bytes transferred and the same effective window size we got 50 Mbs with 1472-byte datagram, 55 Mbs with 2944-byte datagram, and 58 Mbs with 5888-byte datagram. (Figure 6) The UDP datagrams greater than the 1500-byte Ethernet MTU are fragmented by IP. These larger datagrams have an effect similar to TCP's delayed-ACK, in that all fragments (MTU-sized) have to arrive before the receiver ACK's the datagram. Kent and Mogul, however, argue that IP fragmentation usually lowers network performance.

Figure 6. Average and instantaneous throughput for various datagram sizes.

The same effect can be gained by a "virtual MSS", that is choosing an initial startup of K segments (initsegs, and then adding K segments per RTT (increment) during recovery. The virtual MSS avoids the IP fragmentation. (Figure 6a) illustrates a transfer from NERSC to ORNL with two packets drops using the default MSS and then using a virtual MSS of 10 segments. The virtual MSS is also used for slow start, so there is improvement in starting the connection as well.

Figure 6a. Average and instantaneous throughput for virtual MSS of 10x.

We have recently ('03) added Kelly's scalable TCP modifications the additive increase. An early comparison of recovery from a single loss can be seen here. This is an atou transfer from ORNL to LBL (80 ms RTT) with window limit of 1500 segments and with no delayed ACKs.

Parallel TCP streams also has the effect of multiplying the MSS by the number of streams, but parallel streams also profits from TCP's fairness by letting the application acquire a greater proportion of the bandwidth from competing application flows. Parallel flows often sustain higher throughput in the presence of loss, because often only one stream is affected, and it can recover to its window-bandwidth faster than a single stream with a larger window.

The configuration values initsegs, ssincr, and thresh_init affect the rate and duration of slow start. As noted above, the initial window size (initsegs) can be used to create a virtual MSS, and there are several papers and RFC drafts on choosing the initial window size. Doubling the initial window size reduces the duration of slowstart by one RTT. By increasing ssincr from 1 to K, the duration of slowstart is reduced by factor of log(K)/log(2). In the figure below, standard slow-start (increment of 1) is compared to a 10s test using a slow-start increment of 100. The path is from ORNL to CERN (200 ms RTT) Though not illustrated below, a larger ssincr will also speed recovery in those cases where slow-start is used during the initial recovery phase (e.g., timeouts).

Slow start stops when ssthresh is reached. In most TCP implementations, the initial value of ssthresh is infinite, and is set to cwnd/2 when congestion occurs. If the transfer is not window-limited, our experience has been that packet loss often occurs during initial slow start, so one can set thesh_init to some fraction of rcvrwin to try to avoid this early loss. Setting thesh_init to zero makes the inital value of ssthresh infinite. Floyd's slow-start modifications (max_ssthresh) and Vegas (gamma) can also moderate slow-start.

2. UT-ORNL

An OC3 (150 Mbs) connects ORNL and UT. We tested throughput with between two 100 Mbs hosts and found that the ATM/OC3 has been provisioned as a VBR link. Using rate-based flooding of UDP packets, we can reach nearly 100Mbs for a short while, then the traffic policing cuts the sustained rate to about 50 Mbs. We saw similar behavior using AAL5. The VBR policy has an interesting affect on TCP, letting it ramp up during slow-start, then dropping segments to bring the sustained rate back down. Choosing a window of about 22KB avoids any packet loss and achieves a bandwidth of 41 Mbs. Larger window sizes induce loss and reduce bandwidth, smaller window sizes cause the transfer to be buffer-limited.

3. ORNL-cable

We tested throuhgput and loss characteristics from a home cable-modem system to ORNL. The route is asymmetric and tortuous, some 20+ hops all the way to the west coast and back. During one test period, the route into the home cable modem appeared to split the traffic across multiple routes. Rate-based UDP flooding could reach 6+ Mbs in to the home without loss, BUT our test software showed that the UDP packets were wildly out of order (60% of the packets). Packets could be displaced by as much as 30 positions. TCP inbound throughput over this path could only reach 700 Kbs, because TCP treats 3 out of order packets as a loss and does a retransmit. With our TCP-like UDP protocol, we increased the "dup threshold" to 9 to account for the massive re-ordering and improved the throughput to 1.7 Mbs.

A more dramatic illustration of the effect of re-ordering is illustrated in the following bandwidth plot of two flows from LBNL to ORNL (OC12, 80 ms RTT) with a window of 2000 segments, with delayed ACKs, and with SACK enabled. (The plot shows bandwidth measured every 0.1 second and the running average.) The path out of LBNL exhbits a lot of reordering. Using the standard threshold of 3, we can achieve only 18 Mbs after 10 seconds because of 9 congestion events resulting in 289 retransmits. However, the atou server reports 289 duplicate packets received and a max out-of-order of 7, so none of the packets were actually lost but were merely out of order. We ran another test with the client's dup_thresh set to 19. This flow experiences no losses and reaches 218 Mbs after 10 seconds. This particular period of heavy reordering lasted several hours and is probably the result of path through a particular router.

The following figures shows the re-ordering events (green columns) plotted with the instantaneous throughput. In each event, one packet was out of order by 8 slots. For the threshold 3 example, there were 47 reordering events (0.3% of packets were out of order). For the threshold 19 example, there were 309 reordering events (0.2% of packets were out of order). The reordering events seem to be related to congestion or data-rate. The time before the next in order packet arrived was less than a millisecond, suggesting a reordering anomaly in a particular router. See our references page for papers on adjusting TCP to packet reordering. Linux 2.4 TCP does a good job of adapating to reordering and "undoing" the associcated congestion avoidance (see our autotuning page).

4. ORNL-cable (ACK-limited)

The cable modem system we were testing limits data rates from the home to 128 Kbs. This limit in combination with the mutli-hop route actually limited our throughput into the home. Our TCP tests into the home cable modem were actually faster than what we could achieve with our TCP-over-UDP. Our TCP-over-UDP was not using delayed ACKs so there was twice as much bandwidth being consumed on the ACK path, and we were sending empty SACK blocks, further wasting ACK bandwidth. We trimmed off the empty SACK blocks in the ACK path and then enable delayed ACKs in our TCP-over-UDP software and that permitted us to meet or exceed the TCP bandwidth. So for this configuration, delayed ACK's provided a speedup. In some of our ns simulation work we noted that eliminating delayed ACK's also reduced the performance of TCP Vegas. We also have observed that no delayed ACK's during startup (observed with Linux 2.4) may cause packet burst from GigE interfaces that cause drops at some device in the path, so delayed ACKs may be desirable in this case.

We also tested the TCP-over-UDP protocol on slow dialup links, a dual ISDN link, wireless (802.11), Ethernet (10/100), FDDI, GigE, and HiPPI2/GSN links with 64K MTU (reached 1.3 Gbs). We have yet to do rigorous tests comparing the protocol's behavior to TCP, but throughput and loss rates for back-to-back tests are usually quite similar. (This presumes the network hasn't changed "much" ( :-) ) from one test to the next.)

How atou differs from TCP

Our TCP over UDP can do some things that most TCP implementations cannot do.

Our TCP over UDP also cannot do some things that TCP can do.

Summary

Work began on atou in the fall of 2000. We have tested our TCP-like protocol on a variety of networks, and on a number of operating systems, including Linux, Solaris, Sun OS, Irix, FreeBSD, Compaq's OSF, and AIX. We also did a port to Windows as well, though the source files do not contain those modifications. Where TCP could be tuned, we tried to get TCP to match or exceed the UDP client/server data rate. The TCP-like protocol has provided higher throughput in some cases than TCP.

The software provides both a time-stamped packet log and event-based information (entering/leaving recovery, timeouts, out-of-order ACKs, etc.) The debug modes give useful information on which packets were lost, window/threshold changes, bandwidth, round-trip times, and the classical sequence/ACK number graphs. The event and trace data have been very useful in understanding packet loss and throughput on various links.

The TCP-like UDP protocol is appealing because it does not require kernel mods to tweak TCP-like congestion/transport parameters. We have demonstrated striking changes in performance when modifing our transport's buffer size, MSS size, or the AIMD parameters. Disabling delayed ACKs and altering the dup threshold has also improved performance in some networks. For multiple losses within a window, the SACK/FACK options improved recovery as expected. We did not see much effect in altering the burst limit or the initial ssthresh value (limits initial slow start). Since we use slow-start after leaving a congestion event, bursts are reduced.

Future work

Links

See our tech report A TCP-over-UDP Test Harness (June, '02), though the info on this web page is probably more current.

PSC's Treno uses UDP/ICMP packets and TCP-like slow-start and congestion control to estimate TCP capacity of a path. In November, 2001, Allman describes a UDP-based tool cap with TCP controls that can be used to measure the bulk transfer capacity of a path. Visit the network performance links page for related RFC's and papers.


Last Modified Monday, 12-Jan-2004 14:30:27 EST thd@ornl.gov (touches: 395709 )
back to Tom Dunigan's page or the Net100 page or the ORNL home page