Faster Bulk Transfer Starring: *UDP*

Study and Evaluation of 3 UDP protocols

Tsunami is another protocol born out of desperation with TCP. The authors were working toward the launch of the Global Terabit Research Network. There was a launch demonstration at a meeting in Brussels and they wanted to do something flashy and memorable. They had demonstrated wire-rate gigabit Ethernet transfers in their lab using normal Ethernet MTUs and were confident they could easily achieve more than 500Mbs. One PC was shipped to Belgium and one to Seattle. Once they were set up, the testing began. Because of a 3% packet loss, the rates varied from a few tens of Mbs to a very few hundreds of Mbs. Less than one week before the demo, the Lab decided they were going to have to design their own protocol. Less than 3 days after a white board diagram, there was a working prototype and a few days later the demo managed to average over 800Mbs for 17 hours and 40 minutes. From the general comments and the amount of time involved, this was probably a UDP blast with a minimum of other features.
Tsunami is evolving, however, to include some features intended to make it more network-friendly such as:

Tsunami is an application library implemented in (well commented!) C.

1. use of TCP/UDP channel(s)
Tsunami uses a TCP channel for control packets and a UDP channel for data packets sent from server to client. SO_SNDBUF for the sender and SO_RCVBUF for the receiver are set for UDP with setsockopt(). TCP_NODELAY and SO_REUSEADDR are set on both sides for TCP. tsunamid listens on a TCP port for client requests and, upon receiving one, forks a child process to deal with the connection. Command line options(shown with the default) for tsunamid include: verbose(yes), transcript(no), ipv6(no), tcp port(46224), shared secret, size of datagram/block in bytes(32768), udp buffer size(20MB).
tsunami, the client, connects to the server on the TCP port(46224). The client calls fdopen() to convert its TCP channel to a standard I/O stream and uses the standard fread and fwrite calls to send/receive control information. The server uses fcntl() to make its TCP channel blocking while the up-front negotiations for file transfer are going on and then non-blocking so it can be checked after the transfer of each block without stopping the UDP blast unless there is a control message from the client.
After a TCP connection is established, the client acts on commands typed in at the prompt communicating with the server the needed information. The Tsunami user interface seems to be modeled on ftp with the possible client commands being: connect, get, close, help, quit, and set. The "set" command gives an opportunity to affect 12 parameters (shown with the default values)--
server = localhost
port = 46224
buffer = 20000000
verbose = yes
transcript = no
ip = v4
output = line
rate = 1000000000
error = 7.50%
slowdown = 25/24
speedup = 5/6
history = 25%
datagram = 32768
Upon receiving a "get ", the client obtains a UDP socket and sends the port number to the sender. Once the file is verified, all data is sent thru the UDP socket from sender to client. The UDP checksum is used to insure a block is transferred correctly.

2. rate-control algorithm
The beginning rate is set to the DEFAULT_TARGET_RATE(1000000000)--unless changed by the user with the SET command-- and a timed select() call is used to implement the IBD(inter-block delay). The beginning IBD is calculated as:
    param->ipd_time   = (u_int32_t) ((1000000LL * 8 * param->block_size) / param->target_rate);
    xfer->ipd_current = param->ipd_time * 3;
When a REQUEST_ERROR_RATE message is received by the sender, the IBD is recalculated using the new error rate. If the new rate is greater than the accepted maximum, increase the delay--slowing things down. Otherwise decrease the delay--speeding things up

        if (retransmission->error_rate > param->error_rate) {
            double factor1 = (1.0 * param->slower_num / param->slower_den) - 1.0; 
            double factor2 = (1.0 + retransmission->error_rate - param->error_rate) 
		/ (100000.0 - param->error_rate); 
            xfer->ipd_current *= 1.0 + (factor1 * factor2);
        } else
            xfer->ipd_current = (u_int32_t) (xfer->ipd_current * 
		(u_int64_t) param->faster_num / param->faster_den);

        /* make sure the IBD is still in range */
        xfer->ipd_current = max(min(xfer->ipd_current, 10000), param->ipd_time); 
Block size is a key parameter since block_size datagrams are handed off to UDP then to IP. The default block size of 32K means IP will fragment a block into 23 or so packets and send them out before implementing any delay. The IBD is implemented between blocks, not between packets unless a block equal a packet in size. Obviously, a smaller block size will mean that the sending rate would be adjusted more often keeping the transfer more attuned to the network and the IBD would come closer to being an IPD(inter-packet delay). On the other hand, larger block sizes are more efficient when it comes to file I/O.

3. data sending algorithm
The sender's main functions are to build and send block-sized datagrams and process control packets from the receiver. Data is read directly from the file in block-sized segments into a buffer. A block number and type are attached so the only thing to keep straight is where the file read should begin for the next block.
The sending algorithm is:
  1. initialize
  2. get current time
  3. check the non-blocking tcp channel. If there is control data, read and process the control packet
    • REQUEST_RETRANSMIT -- Retransmit the given block then go to step 5
    • REQUEST_RESTART -- Restart the transfer at the given block then go to step 5
    • REQUEST_ERROR_RATE -- Use the given error rate to adjust the IPD, update and print statistics then go to step 5
    • REQUEST_STOP - go to step 6
  4. build the next new datagram and send it
  5. delay until time to send the next block then go to step 2
  6. do ending stats and close down
4. data receiving algorithm
Until the transfer is complete, the client receives the datagrams and periodically sends control packets to the sender. A thread is created to handle disk I/O.
  1. initialize: allocate a retransmit table, received-data bitfield, and a ring_buffer; create a thread to do I/O.
  2. start timing
  3. reserve a slot in the ring buffer for the next datagram
  4. receive a datagram into the reserved slot
  5. signal the I/O thread that data is ready
  6. if the block number is greater than the expected block number, put the missing block numbers into the retransmit table
  7. if this is the last block
    • if we have gotten all the blocks, go to step 10
    • send a REQUEST_RETRANSMIT packet for any missing blocks
  8. if we have received a multiple of 50 blocks and a preset interval has passed since the last time statistics were updated:
    • if the retransmit queue is over MAX_RETRANSMISSION_BUFFER entries, send a REQUEST_RESTART packet beginning at the first missing block in the queue
    • otherwise send a REQUEST_RETRANSMIT packet for each missing block
    • update rate statistics and send along the smoothed retransmission rate in a REQUEST_ERROR_RATE packet
    • display the latest statistics information
    • reset statistics timer
  9. if there is more data, go to step 3
  10. call pthread_join for the disk I/O thread
  11. send a REQUEST_STOP packet to the server
  12. stop timing, display final results and close up
Graphs illustrating a representative transfer over NistNet using tsunami may be found here.

5. unique features
TSUNAMI implements an authentication process between the server and client via a shared secret.
***Accounting for data via blocks
As mentioned, Data is read, written, transferred and accounted for in block-sized chunks. A block of data is handed off to UDP and thus IP may need to fragment the data at the sender and reassemble at the receiver. When block sizes are large, retransmissions could be a BIG deal and a REQUEST_RESTART could be a REALLY BIG deal! We discovered that using the command line datagram parameter at the sender left it at 32K. We needed to alter the receiver code to accept a datagram/block parameter in order to experiment with different block sizes.
***The file is written out as blocks are received
In tests, this meant that the last block might be written out and so the file would list the correct number of bytes with 'ls -l'. However, there would be blocks missing if the file transfer had not completed. This happened at times when REQUEST_RESTART did not function as designed and both client and server had to be manually stopped. ***Number of user-set parameters
The tsunami client allows the user to set and tune up to 12 different parameters as noted above. This includes performance specific parameters such as buffer size, target rate, expected error rate, slowdown and speedup factors and the percentage of history used in the rate calculation.