Thursday, November 13, 2008

Understanding the Series of Tubes ...

The Internet is massive and complicated and this makes it difficult for researchers to understand its deficiencies and strengths and developers to write applications on top of it. In many ways there isn't really any such thing as a "debugger" for the network, and that makes it hard to do performance debugging (researchers) and logic debugging (application developers).

I'm discussing two papers here, the first is "End-to-End Internet Packet Dynamics" [DYNAMICS] and the second one is "X-Trace: A Pervasive Network Tracing Framework" [XTRACE].

The author (Vern Paxson) of [DYNAMICS] was very interested in understanding network pathologies like out-of-order delivery, packet replication, and packet corruption. I do find it rather ironic that these are "pathologies" considering the architects of the Internet wanted to be robust in exactly these cases. In fact, I don't know how far off base I would be in suggesting that the original architects of the Internet saw these pathological cases as much more of the norm than what we consider today. Vern even comments that in the presence of these pathologies certain network assumptions are no longer applicable ... this implies to me that the community has generally accepted the fact that the common case is no out-of-order delivery, no packet replication, and no packet corruption.

So how devastating can these pathologies be? Well, Vern claims that reordering is not common on all paths, but it can be incredibly common on some paths, especially paths that have frequent route fluttering. In general, however, it seemed the consensus was that reordering generally has little impact on TCP performance. The exception, however, is when trying to determine the duplicate ack threshold. In some cases it might be beneficial to delay the dups to better disambiguate packet loss from reorderings (disambiguating losses seems to be a big theme for a lot of papers ...).

And packet replication? Not such a big deal, apparently. It only seems to occur due to buggy hardware or software configurations. 

And packet corruption? About 1 in 5,000; given a 16-bit checksum this means on average one bad packet in 65,536 will be erroneously accepted resulting in undetected data corruption! Those don't really seem like the best odds ... or perhaps I am missing something here.

To measure bottleneck bandwidth of routes Vern uses packet pair, which I thought was an elegant little trick. The packet pair technique works as follows. Send two packets with spacing x and measure what their spacing is when they get to the receiver. The amount the were "spread out" should tell you the bottleneck bandwidth on a path!

Vern continues his study by looking more closely at packet loss and the efficacy of retransmissions. He has a nice classification of the three types of retransmissions as, "unavoidable", "coarse feedback", and "bad retransmission timeout". This is interesting because it reminds us that in our systems there are two kinds of losses: real losses versus measurement losses (when our tools tell us there is a loss but there really wasn't one). Clearly we want to minimize the amount of retransmissions given the latter. It was even more interesting to see that there were still some buggy implementations of TCP out there that mad it such that the "unavoidable" retransmissions were not the most frequent!

This paper had lots of goodies in it ... and I could probably go on and on and on. Instead, I'll switch gears and discuss X-Trace a bit.

The X-Trace authors were trying to solve a real problem that clearly had plagued some of them before. I know I've been trying to build Internet applications and wondered aloud what in the heck could possibly be happening to my packets? In fact, I'm working as we speak on wireless mobility and some of my results really make me scratch my head ... if I could know exactly where my packets were going, that would be amazing!

Of course, that is a lot to ask for: it is hard enough to get lots of administrative domains to implement the current standards, it seems dubious that we can impose more standards on top. Of course, I thought the authors of X-Trace made two points very clear: (1) X-Trace can be implemented incrementally and (2) the overhead of adding X-Trace is not that substantial from a cognitive perspective or a performance perspective. Both (1) and (2) make X-Trace much more appealing to the community.

The X-Trace concept is relatively straightforward: application layers must perform "pushNext" and "pushDown" events on packets in order to propagate the metadata along the path. The metadata must travel in-band in order to be useful, but the resulting trace can be returned in any manner appropriate (out-of-band, delayed, abridged, not at all, etc). Once a user has gotten as much of the trace as possible, they can perform an offline computation of the data to get the resulting "task tree".

I really liked the idea of X-Trace ... from a completely practical perspective it would be incredibly helpful. In fact, I wouldn't be surprised if a company like Google, that writes or rewrites all of the software used internally, has tried to develop a similar architecture. The benefit they have, of course, is that they have complete control of all of their machines and can make all the necessary changes.

I suppose the counter-argument for something like X-Trace is that some people really don't want others to know how data is making its way through the Internet. It could be an economic concern or a security concern (although security through obscurity is not the right way to go). To that effect, it is really unclear to me the community would respond (or did respond) to this sort of work.

1 comment:

Randy H. Katz said...

Awesome posting. I need to process this in more than real time. About X-trace, you are spot on that it was motivated by the need to solve a practical problem. Its elegance is that by following packet paths, you can reconstruct the causality of events across the network. This is an amazingly useful thing. There are practical issues in terms of scaling, implementation, information hiding, etc., but the visibility it gives you is incredible.