Monday, November 24, 2008

The Modern Data Center

I've titled this post "The Modern Data Center" because I'm discussing papers that are targeted at the modern data center: either (1) a private data center that has been heavily administered or (2) a public data center that is used in a best-effort style without any sort of service level agreements. It is important to realize, of course, that administration of your resources in public best-effort data centers might still be valuable and therefore (1) is not completely disconnected from (2). The papers I'm discussing are "A Policy-aware Switching Layer for Data Centers" [PLAYER], and "Improving MapReduce Performance in Heterogeneous Environments" [LATE].

The authors of [PLAYER] were attempting to provide a better solution to the current state-of-the-art in network setup and administration. The main issue was that the current techniques were incredibly inflexible, sometimes inefficient, and sometimes incorrect. Their solution was to introduce a new layer into the stack called the PLayer that allowed them to route traffic in such a way that the network could change dynamically. This provided, for example, the ability to add and remove middleboxes effortlessly. In existing deployments, middleboxes are typically connected in series even if not every piece of traffic needs to be handled by that middlebox (this motivates the inefficiency argument). In such a deployment, adding or removing a middlebox would require very sophisticated configuration or taking all the machines in that series offline while performing any changes.

I completely support the motivations driving the authors of [PLAYER], but I'm not convinced that their solution is the right one (it seems a bit too involved). For example, taking middleboxes off of the physical network path seems like the right thing to do. In general, any deployment that gives you the "logical" arrangement that you want without forcing you into any specific "physical" arrangement is preferred because it gives you flexibility. The authors do at some point suggest that these indirect paths may introduce an efficiency concern, but I'm not convinced as much.

My issue with [PLAYER] is nothing more than the fact that I feel like they could have solved this problem simply using netfilter. In fact, I can only one main advantages to using their system that would not have been achieved by default in netfilter: they have a single administrative source (versus administering netfilter at each of the routers and middleboxes) that disseminates the configuration appropriately. I actually see this as a HUGE selling point, although perhaps they could have instead created a "distributed netfilter". Another way of looking at this is that perhaps this is great research because in retrospect it is ... obviously a good idea! To the credit of the authors they provide lots of great intellectual arguments for why their setup is robust in certain network settings, which I think could also be applied to the right setup of a distributed netfilter.

The authors of [LATE] discuss even more modern usage of data centers commonly dubbed cloud computing. These authors specifically wanted to explain why the open-source MapReduce variant Hadoop performs poorly when the cloud is heterogeneous. Specifically, they propose and justify a new scheduler, Last Approximate Time to End (LATE), to be used within Hadoop that outperforms the native Hadoop scheduler for most of their tests.

For the most part, I like the work behind [LATE]. The authors investigated a novel piece of cloud computing, found some deficiencies, and proposed good, well-founded solutions. Perhaps my only criticism of the [LATE] work is that I don't share the authors views that the Hadoop maintainers made implicit assumptions about the environments in which Hadoop would be used. Rather, I'm guessing something more practical occurred: Hadoop was implemented using a homogenous cluster and fine-tuned under that setting. Of course, I'm not sure about this, but I'm dubious that if the maintainers of Hadoop had used a cloud computing architecture like Amazon's EC2 that they would still have made the scheduling implementation decisions they made. In fact, I would be very interested to see what Jeff Dean and others have to say about this work ... I'd be curious if they had experienced any sort of heterogeneous effects themselves.

Tuesday, November 18, 2008

One Size Doesn't Fit All

I have to use one of my favorite sentences from any paper to start off this post (slightly modified): "it is fundamentally impossible to provide an abstraction in a way that is useful to all applications and to implement these abstractions in a way that is efficient across disparate application needs" [Exokernel]. In many ways, this is reminiscent of the end-to-end argument!

So what am I blabbering about today? Two papers: "An architecture for Internet Data Transfer" [DOT]  and "A Delay-Tolerant network Architecture for Challenged Internets" [DELAY].

The theme I've chosen is this notion that one size doesn't fit all, which is very much the motivation behind DOT. Technically speaking, they advocate the separation between a "control" channel and a "data" channel and a clean interface hiding the actual mechanisms of transport. That way, when data actually needs to be exchanged the application can select some generic "policy" such as best-effort or reliable, high-bandwidth or low-latency, etc, and the system can figure out how to get the data there. In my opinion, I wish this layering existed in the first place ... instead of being stuck with TCP and UDP. That is, TCP and UDP is not a size (abstraction) that fits all, but we don't have any other options! Of course, practically speaking it is hard to deploy other options, and even DOT would require applications to be modified and redistributed.

Nevertheless, I'm definitely a supporter of something like DOT. In fact, one of the possible applications of DOT is the ability to exploit multiple network paths, which is exactly what I'm presently working on! They even show the types of savings you can get when using multi-paths, which can be as high as 50%!

One of the other benefits of providing a more well defined transport interface (or less, depending on how you think of it) is that you can create transport policies that adhere well to certain network environments, like "challenged networks". This brings us to our second paper, where the authors discuss a need for a network architecture that can be robust in the face of many possible "failures" (mostly latency issues). Ironically, these "delay-tolerant" networks seem to make Vern Paxson's pathological cases the common ones.

I've always had my issues with delay-tolerant networks:
  • I'm concerned about all the "internets" we set up in developing regions will soon be overrun by "real" infrastructure (you see it all time ... in fact, some "third world" countries have better infrastructures simply because an industry has moved in and paid the initial up front cost knowing they can reap benefits in the future). This makes me weary of trying to create delay-tolerant GENERAL PURPOSE networking (for HTTP, etc). 
  • In fact, I see delay-tolerant networks as being great for specialized settings, under specialized environments, using specialized hardware and specialized software. Military is a specialized environment, sensor nets getting telemetry information, also specialized.
I think there are lots of interesting research questions regarding DTN, I'm just not convinced that it is something we would want for the general purpose Internets.

Thursday, November 13, 2008

Understanding the Series of Tubes ...

The Internet is massive and complicated and this makes it difficult for researchers to understand its deficiencies and strengths and developers to write applications on top of it. In many ways there isn't really any such thing as a "debugger" for the network, and that makes it hard to do performance debugging (researchers) and logic debugging (application developers).

I'm discussing two papers here, the first is "End-to-End Internet Packet Dynamics" [DYNAMICS] and the second one is "X-Trace: A Pervasive Network Tracing Framework" [XTRACE].

The author (Vern Paxson) of [DYNAMICS] was very interested in understanding network pathologies like out-of-order delivery, packet replication, and packet corruption. I do find it rather ironic that these are "pathologies" considering the architects of the Internet wanted to be robust in exactly these cases. In fact, I don't know how far off base I would be in suggesting that the original architects of the Internet saw these pathological cases as much more of the norm than what we consider today. Vern even comments that in the presence of these pathologies certain network assumptions are no longer applicable ... this implies to me that the community has generally accepted the fact that the common case is no out-of-order delivery, no packet replication, and no packet corruption.

So how devastating can these pathologies be? Well, Vern claims that reordering is not common on all paths, but it can be incredibly common on some paths, especially paths that have frequent route fluttering. In general, however, it seemed the consensus was that reordering generally has little impact on TCP performance. The exception, however, is when trying to determine the duplicate ack threshold. In some cases it might be beneficial to delay the dups to better disambiguate packet loss from reorderings (disambiguating losses seems to be a big theme for a lot of papers ...).

And packet replication? Not such a big deal, apparently. It only seems to occur due to buggy hardware or software configurations. 

And packet corruption? About 1 in 5,000; given a 16-bit checksum this means on average one bad packet in 65,536 will be erroneously accepted resulting in undetected data corruption! Those don't really seem like the best odds ... or perhaps I am missing something here.

To measure bottleneck bandwidth of routes Vern uses packet pair, which I thought was an elegant little trick. The packet pair technique works as follows. Send two packets with spacing x and measure what their spacing is when they get to the receiver. The amount the were "spread out" should tell you the bottleneck bandwidth on a path!

Vern continues his study by looking more closely at packet loss and the efficacy of retransmissions. He has a nice classification of the three types of retransmissions as, "unavoidable", "coarse feedback", and "bad retransmission timeout". This is interesting because it reminds us that in our systems there are two kinds of losses: real losses versus measurement losses (when our tools tell us there is a loss but there really wasn't one). Clearly we want to minimize the amount of retransmissions given the latter. It was even more interesting to see that there were still some buggy implementations of TCP out there that mad it such that the "unavoidable" retransmissions were not the most frequent!

This paper had lots of goodies in it ... and I could probably go on and on and on. Instead, I'll switch gears and discuss X-Trace a bit.

The X-Trace authors were trying to solve a real problem that clearly had plagued some of them before. I know I've been trying to build Internet applications and wondered aloud what in the heck could possibly be happening to my packets? In fact, I'm working as we speak on wireless mobility and some of my results really make me scratch my head ... if I could know exactly where my packets were going, that would be amazing!

Of course, that is a lot to ask for: it is hard enough to get lots of administrative domains to implement the current standards, it seems dubious that we can impose more standards on top. Of course, I thought the authors of X-Trace made two points very clear: (1) X-Trace can be implemented incrementally and (2) the overhead of adding X-Trace is not that substantial from a cognitive perspective or a performance perspective. Both (1) and (2) make X-Trace much more appealing to the community.

The X-Trace concept is relatively straightforward: application layers must perform "pushNext" and "pushDown" events on packets in order to propagate the metadata along the path. The metadata must travel in-band in order to be useful, but the resulting trace can be returned in any manner appropriate (out-of-band, delayed, abridged, not at all, etc). Once a user has gotten as much of the trace as possible, they can perform an offline computation of the data to get the resulting "task tree".

I really liked the idea of X-Trace ... from a completely practical perspective it would be incredibly helpful. In fact, I wouldn't be surprised if a company like Google, that writes or rewrites all of the software used internally, has tried to develop a similar architecture. The benefit they have, of course, is that they have complete control of all of their machines and can make all the necessary changes.

I suppose the counter-argument for something like X-Trace is that some people really don't want others to know how data is making its way through the Internet. It could be an economic concern or a security concern (although security through obscurity is not the right way to go). To that effect, it is really unclear to me the community would respond (or did respond) to this sort of work.

Thursday, November 6, 2008

Internet Indirection

The theme for today's discussion is Internet indirection! Middleware like firewalls and network address translation boxes have proliferated in the Internet but are often scorned upon as breaking the end-to-end design argument. Specifically, a firewall filters packets by explicitly looking at packets that are not addressed to it (and depending on the firewall looking as far up as the data payload of the higher level protocol). A NAT on the other hand might be considered a worse or less offender because it actually rewrites the packet header as is necessary (changing the source address or destination address).

Two papers address possible alternative Internet architectures that allow, or even encourage, indirect processing of packets. The two papers are, "Middleboxes No Longer Considered Harmful" [MIDDLE] and "Internet Indirection Infrastructure" [i3].

The [MIDDLE] paper can actually be thought of as an extension to [i3] where they focus exactly on allowing the notion of middleboxes like firewalls and NATs. In fact, in many ways it seems like [i3] subsumes [MIDDLE] and provides the more generic architecture. In fact, after reading [i3], I wasn't horribly impressed with [MIDDLE]!

The [i3] work, however, I found very elegant and appealing. The principle of i3 is rendezvous routing. Essentially, a server can place a public trigger on the Internet and end hosts can try and send data to the id of that trigger which then performs routing lookups to determine the ultimate destination IP. The indirection is the fact that you send to these identifiers rather than sending to an IP address and the identifiers are stored uniquely and can be updated as the server sees fit (as it moves, for example). Great idea, in my opinion.

Of course, how much different is this then requiring that every single send of data require a DNS lookup? As long as I can update my DNS name server every time I change my name and have it resolve my mapping with a TTL of 0, then it seems like I can more or less accomplish the same thing. The crux is how efficiently this could be performed for something the size of the entire Internet. The basis of DNS is that caching provides performance, and in the absence of caching using a lookup mechanism like DHT might be the absolute right way to go.

To allow middleware the i3 has a notion of a stack of identifiers. This allows sending data to a certain identifier which itself has a stack of identifiers that the packet must traverse, in a way, very much like source routing. Of course, now, because each IP packet is addressed to the right node, it can do whatever transformations of the packet it sees fit. The other added benefit is this model is that you don't need the middleware to exist physical in between an end host and the rest of the Internet! The authors of [MIDDLE] discuss the advantages of this in depth, although they also mention that the physical separation can some times be advantageous for security.

The identifiers in i3 were interesting because, while they did provide for this indirection, they also make it clear that identifiers should be picked in a way which provides geographical/network proximity. In fact, they even mention making one part of the identifier fixed for geography and the other part vary (the "true" identifier). This sounds strikingly similar to a hierarchical address ... (IPv6?).
   
In terms of performance, [MIDDLE] seemed to have lots of unfortunate overheads. I was rather disappointed with their discussion on packet overheads do to their extra header ... obviously there will be some packet overheads! Essentially, overlay solutions like these might suffer from extra round trip latencies, extra processing time (to figure out, for example, where to route to next), etc. Given these, it make sense that we haven't seen these take over in the Interent ... at least not yet. Of course, there are lots of companies trying to use the overlay and P2P like model to multicast stream live media.  

Tuesday, November 4, 2008

Domain Name System

The Domain Name System (DNS) is probably the best know example of a well used distributed system that the Internet has to offer. I'm discussing two papers in this blog post related to DNS, "Development of the Domain Name System" [DEV] and DNS Performance and the Effectiveness of Caching" [PERF].

The authors of [DEV] gave a report on the development of DNS and learned lessons. First, they clearly make a point of trying to justify that their system can scale. They even claim that close to "50% of all root server traffic could be eliminated by improvements in various  resolver implementations to use less aggressive retransmission and better caching". Interestingly enough, the authors of [PERF] claim that implementations are still "overly persistent"! In fact, they say that implementations still incur many more retransmissions than are necessary. It seems ironic (in the sense that there is all this other distributed systems research) that the entire DNS system scales based on the simple ideas of redundancy and caching!

While the authors of [DEV] claim that caching was a success ... they don't mention some of the security issues (or mention security at all) that makes caching a potential problem. Specifically, there have been techniques that have taken advantage of DNS caches (cache poisoning). I'm not sure of the status of these exploits ...

The authors of [DEV] also discussed how negative caching can be effective even though it conflicts with their intuition. The authors of [PERF] actually show that 13% of all lookups actually result in a negative response. This actually seems very realistic to me because lots of applications try and do reverse lookups on IP addresses. However, lots of name server implementations choose simply not to include that information because the IP addresses that they map are constantly changing (I believe dynamic DNS has this problem).

I'll close with perhaps my favorite part of the discussion, why performance of the DNS system was better than use of the local HOSTS.TXT file. As the authors state, "the old mechanisms were created for a much smaller database and were not adjusted as the size of database grew explosively". That is, DNS was immediately more efficient than doing a search through the really large HOSTS.TXT file! This seems like a lucky break for the DNS folks ...


Thursday, October 30, 2008

P2P!

The truth is, P2P is a really cool research topic, even if its number one application is sharing content illegally. Thus, I'm spending this post discussing the primary role of P2P: looking up data (which arguably most early P2P applications got wrong). The paper I'll be addressing most is Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications.

First, it should be clear that lookup is not inherently an easy task. Or at least if we assume that we want resiliency to failures, i.e. no central database, or even a collection of root databases under some hierarchical scheme (like DNS). Given that, a straw man approach would be something like broadcasting the message out ... but that clearly sucks. We would prefer a solution that bothers the least number of nodes possible. The authors of [CHORD] present an O(log N) solution! They do have an assumption, of course, which is that the individual nodes are willing to store information about content that they might not themselves actually have (i.e. they store a pointer(s) to who does actually have the content, but they are the ones responsible for resolving any lookups for that data).

Some key parts of Chord:
  • Essentially a distributed hash table.
  • Uses consistent hashing, which has the nice property that each node in the system sees approximately an equal number of keys, and when a node joins or leaves the system only O(1/N) keys need to be moved to a new node.
  • Avoids the requirement that every node knows about every other node. Each node only needs to store information about O(log N) nodes; this is especially helpful in routing and keeps lookups at O(log N).
  • Was implemented for correctness, i.e. handles potentially conflicting concurrent operations (joining or leaving, failure, etc) well.
The authors show a few simulations of Chord, as well as one real Internet prototype, in which it scales well as the number of nodes increases. I'm curious, however, how well this performs in smaller settings where latencies are actually much lower (like within a single data center). Lots of applications are being written in distributed styles where they have lookup needs that they would like to perform efficiently. I'm curious if the design decisions that went into Chord in any way limit it from performing as optimally as it could in a smaller data center like environment ... or if there are instead much better mechanisms for doing lookups there.

Tuesday, October 28, 2008

Overlay Networks

I'm addressing two papers in this post: Resilient Overlay Networks [RON] and Active network vision and reality: lessons from a capsule-based system [ACTIVE]. Both papers are attempting to address the shortcomings of the existing Internet architecture.

The authors of [RON] propose creating an overlay network with three goals in mind, "(i) failure detection and recovery in less than 20 seconds; (ii) tighter integration of routing and path selection with the application; and (iii) expressive policy routing".

The authors of [ACTIVE] also propose to create an overlay network, however, they state their main goal as simply "allow untrusted users to control the handling of their own packets within the network", which, more or less, is the aim of item (i) and (ii) of [RON], however, [ACTIVE] makes it much more explicit and therefore seems to have been harshly criticized (why would I ever want someone else's code to run on my router!).

While the authors of [ACTIVE] do mention that loss information is "clearly useful for congested-related services", they don't seem to make loss detection and recovery a top motivation for their work. (Because, come on, their work can effectively subsume [RON] ...)

In [RON] however, loss is a huge, if not the ultimate motivation. The issue is this: when routes are lost in the network the existing BGP protocol is ineffective, taking on the order of minutes to stabilize a new route. For lots of applications, this is unacceptable (consider a VoIP call, not only will we get "disconnected" but I can't even call back and try again ... unless I wait for a few minutes). As the authors of [RON] point out, the users perception of a "disconnected" connection is not on the order of several minutes like BGP imposes, but someone might be willing to hang out as long as about 20 seconds.

Of course, measuring path outages is hard, which [RON] makes clear. This reinforces, once again, that perhaps, while simple and elegant, using only a single bit of information for congestion and loss is unacceptable! I liked their discussion of failures: they categorized them in terms of either the network's perspective or the applications. The network sees link failures and path failures while an application sees only outages and performance failures.

In general, between both [RON] and [ACTIVE] I was a little concerned about adding this "hammer" on top of the existing Internet architecture. The authors of [RON] even acknowledge the fact that it is unclear how effective a RON will actually be if it became a standard (specifically, what happens when everyone is sending "active probes" to figure out route information ... I mean, should active probes actually be the way we figure out routes?!).

In the case of [RON], why can't we try and get a more responsive BGP to get fast detection and recovery? Why can't we get a better TCP as the basis for applications (so that they can more tightly integrate)?

In  the case of [ACTIVE], do programmers really want, or need, or make decisions about routing/forwarding at each and every hop? They even say themselves, "... it is clear that greater application experiences is needed to judge the utility of active networks. No single application ("killer app") that proves the value of extensibility has emerged, and none is likely to ..."!

Speaking of extensibility, I disagree with the analogue that the authors of [ACTIVE] draw in saying that "Active networks are no different in this regard than extensible operating systems, databases, or language implementations". I don't see it so optimistically ... when you are using disparate resources managed by lots of different people (unlike code running in an operating system on ONE computer) then extensibility is clearly not always what you may want (again, I don't want some other person's code to execute on my router!).

The authors of [ACTIVE] do provide some other valuable insights. Specifically, that systematic change is difficult and it seems necessary to be able to do experimentation without requiring deployment. Of course, isn't that what PlanetLab was for? In addition, the authors of [ACTIVE] discussed how their work may clash with the end-to-end argument, but they claim that it is not an issue (and lots of deployed services already do things that break the end-to-end argument ... firewalls, proxies, etc).

Ultimately, it seems that the authors of [ACTIVE] see their work as being most compelling for network layer service evolution ... which seems like a good fit.

Tuesday, October 7, 2008

Multi-Hop Wireless Routing

In continuation of the multi-hop wireless discussion from previous posts, I'm discussing some salient points from two papers, A High-Throughput Path Metric for Multi-Hop Wireless Routing [ETX] and A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols [COMPARE].

Again, the motivation for researching routing policies for multi-hop wireless networks should be clear ... the problem changes drastically when you move away from the typical access point wireless communication (wireless communication is not more than one hop).   

The authors of [COMPARE] evaluated four different routing protocols based on three metrics: packet delivery ratio, routing overhead, and path optimality. Packet delivery ratio is exactly like what it sounds like and comparable to link loss ratios in [ETX] (which will be discussed shortly). Routing overhead is exactly what it sounds like as well. Path optimality is an interesting metric because it, as of this paper: "in the absence of congestion or other 'noise', path optimality measures the ability of the routing protocol to efficiently use network resources by selecting the shortest path from a source to a destination".

Before I divulge on the results of [COMPARE] I feel it is a fair time to motivate [ETX], whose premise was exactly: the most efficient paths through multi-hop networks may not always be the shortest ... i.e. shortest path is a bad metric for path optimality. The authors of [ETX] provide examples of such scenarios, but the crux of the problem comes down to the fact that not all wireless links are high-throughput and lossless. Rather, they again reinforce an important point: a single hop that delivers only 50% of packets might be less useful as two hops that each deliver 100% of packets.

In [COMPARE] the authors allude to the possible problems of shortest path optimality as "congestion or other 'noise'". It turns out, as was shown in [ETX], that there is a lot of lossy links ... a.k.a. links that suffer from a lot of 'noise' (interference, etc). In addition, many of these links are asymmetric. That is, sending packets one direction has a higher delivery ratio than the other direction. This is clearly an issue with respect to most wireless protocols because while data may only be sent in one direction, an ACK must be sent in the other direction. Ultimately the authors of [ETX] designed a new protocol called ETX (Expected Transmission Count) that is based on delivery ratios, asymmetry in the delivery ratios, and interference over longer routes.

A few more thoughts:
  • The authors of [COMPARE] claim that wireless networks have a fundamental limitation to wired networks for sending broadcast packets due to the fact that there is no medium to reserve for all "nodes in proximity". Two thoughts here: (1) it seems like you could augment the RTS/CTS scheme to allow for a node to broadcast a message to all nodes that can hear the RTS/CTS; (2) the multi-hop wireless protocols of Roofnet don't use RTS/CTS either.
  • The authors of [COMPARE] focused their studies on mobility and movement (which naturally changed the network topology) while the authors of [ETX] have a more subtle assumption that the wireless nodes are stationary (they at least assume this in Roofnet).
  • DSR (dynamic source routing) seemed to be the best candidate for environments like those simulated in the [COMPARE] paper (highly mobile). 

Thursday, October 2, 2008

More on Wireless ...

Two more papers discussing wireless technologies, Modeling Wireless Links for Transport Protocols [MODELING] and Architecture and Evaluation of an Unplanned 802.11b Mesh Network [ROOFNET].

The authors of [MODELING] were intent on showing the dismal state of modeling and analyzing existing wireless deployments and suggesting alternative, hopefully better, characteristics to use for modeling. I found this study to be a bit dry, but I did have a few of the take-away points:
  • There is a definite tension between changing wireless links versus transport protocols. I actually alluded to this in a previous post where I mentioned the hurdle researchers have in the wireless space, because it is not clear where you need to make compromises for either the wireless protocols or the transport protocols.
  • TCP is not robust in the face of new link layer technologies. IN FACT, it is rather ironic that TCP is sensitive to packet reorderings! The original design goals of networks were to be robust in the face of lost links, lost packets, or packets traveling down different paths ... yet this is clearly not the case with TCP. IN ADDITION, TCP performs poorly when corrupted packets are received ... TCP has been over-optimized for link layers that are perfect (like ethernet). IN ADDITION, spurious timeouts due to link layer retransmissions are a severe problem and the transport protocols (like TCP) inflate transmit timers because they are agnostic of these link layer retransmissions (again, it seems that TCP has been over-optimized for an ethernet like link layer).  
  • The packet loss==congestion signal is inadequate in wireless networks because packet loss might be due to errors ... do we need more explicit signals?
  • A possible solution might be to push information about the link layers into the transport protocols (and maybe even vice-versa as well) ... but doesn't this break the nice abstraction we had before?

The authors of [ROOFNET] showed how an ad-hoc mesh network could be deployed using 802.11b technology. Interesting take-away points:
  • In the authors words, "the highest-throughput routes often involve links with relatively high link-level loss rates, since a single-hop route with 40% loss can deliver more data than a two-hop route with perfect links. In addition, because the 802.11b bit-rates are at roughly power-of-two intervals, a high bit-rate with up to 50% loss is preferable to the next-lowest bit-rate." Interesting!
  • Fast short hops are the best policy (as determined by their routing algorithm Srcr).
  • The majority of nodes use many of their neighbors ... I actually didn't quite understand this conclusion. Does this mean that the majority of nodes have constantly fluctuating best routes to gateways?
  • Multi-hop has a throughput advantage over long low-quality single-hop because it can exploit lots of short high-quality links ... and the overhead of retransmissions appears to be low enough.
  • Roofnet does not use RTS/CTS ... in fact they have shown that it does not improve performance in their network!
An interesting link between the two papers was that [MODELING] claimed that out-of-order delivery or corrupted delivery is shown to be beneficial for real-time applications by decreasing delay, and this is especially true when there are multiple wireless links in the path. However, because none of the connections tested in [ROOFNET] were of the real-time nature, they did not conclude the same thing nor did they show how a new transport protocol might have excelled in their network.

Tuesday, September 30, 2008

Wireless: Media Access Protocols and TCP Performance

This post covers two different papers:
  • MACAW: A Media Access Protocol for Wireless LAN's [MACAW]
  • A Comparison of Mechanisms for Improving TCP Performance over Wireless Links [COMPARE]
The authors of [MACAW], motivated by the proliferation of wireless devices, sought a way to engineer a media access protocol that performed better than existing multiple access congestion avoidance protocols (versus carrier sense protocols).

The authors of [COMPARE] were interested in how TCP could be tuned (or modified) to perform better over the broadcast medium of wireless links (i.e. assuming some fixed media access protocol).

It is interesting to see how, once again, TCP has become such a dominating network transport protocol that we are willing to sacrifice possible performance in order to augment it (as little as possible) to operate over emerging network technologies. In fact, it seems very likely that a new network technology (link-layer technology or transport-layer technology), even one that outperforms existing network technologies, that imposes a seriously difficult transition path for TCP will have a hard time being adopted. The authors of [COMPARE] do mention one new such technology, selective repeat protocol (SRP), but claim was not conclusive as a better alternative to TCP (although selective acknowledgment + TCP was).

It is not clear exactly what is the right level for abstractions and optimizations in the networking stack. Perhaps it is time to investigate how applications might actually perform better if they had a better understanding of the actual mediums that they were running on? Perhaps the applications themselves might better be able to take advantage of wireless networks by changing, for example, their sending patterns? Operating systems research has shown how applications that can take advantage of lower-level resources can often outperform those applications that rely on layers of abstractions (exokernel, scheduler activations, acpm, etc). Perhaps there is an analogue in the networking world?

An interesting aspect of [MACAW] was the investigation of the backoff counter. The authors conclude from their investigation that a backoff counter is needed for each station that a node is attempting to communicate with. In addition, all of the nodes communicating with that station need the same backoff counter in order to ensure fairness. (Digression: While I don't know enough about 802.11(a,b,g), I was not under the assumption that a station simultaneously communicated with more than one node (station), in which case, I'm curious how sophisticated the backoff mechanism needs to be.) I'm curious, given that I'm a frequent of coffee shops, how many of my "delayed" connections are due to backoff congestion, packet interference loss, versus TCP congestion control. (You know, when the network sometimes just stops responding and it is best to actually re-submit the url request ...)

(Hmm, I'm getting used to a new style of writing blog posts, less summary and more interesting thoughts from the paper to encourage feedback/conversation with others ...)

Monday, September 22, 2008

Scaling Internet Routers

In two papers from McKeown and others, "A Fast Switched Backplane for a Gigabit Switched Router" and "Scaling Internet Routers Using Optics", the authors explain how router technology can be scaled well beyond its current needs and close to its theoretic limits. One of the papers discusses how to scale routers for gigabit while the other one shows an impressive design for 100 terabit!

McKeown discusses how, in most routers then (circa 1995), datagrams are transmitted out of a first-come-first-served (FIFO) queue (which is crazy considering the numerous router/gateway algorithms that I have discussed in this blog). He does acknowledge "more advanced routers", but does not seem to imply that many of them actually perform many, if any, of the advanced algorithms such as Fair Queuing.

Both sets of authors describe the basic design of the router, including its parallel design, crossbar switch and use of virtual output queuing (VOQ). The VOQ discussion is especially interesting because it shows how traumatic head-of-line blocking can be (nearly 50% of performance).

McKeown describes how VOQ + iSlip (a crossbar scheduling algorithm) can achieve close to the theoretical maximum of performance. They compare this with a crossbar like router but one which uses FIFO (and variable-length datagrams ...). Ironically, when we relate this work to other work discussed on this blog it becomes apparent that none of the gateway algorithms are actually being used ...

Another rather interesting relation between this work and previous work discussed is its attention to "controlling delay". Specifically, to deal with traffic that needs to have low latency they employ two mechanisms: priority levels and speedup (the latter I think is much more fascinating!). Priority levels are what they sound like, but speedup is kind of clever: operate the crossbar switch twice as fast as the external line! That way, with sufficient speedup one could actually guarantee that every datagram is immediately transferred to its output port, where a departure time can be precisely scheduled (which will eliminate the non-determinism inherent in the router otherwise, which screws up ones ability to reason about delay). They go on to discuss how in theory one would need a speedup of N (where N is the number of ports) to actually achieve the guarantee, but how in practice one can actually get away pretty well with a speedup of just two.

Scaling of routers for gigabit was published sometime in the mid '90s (it was actually a description of the work McKeown did at Cisco for the Cisco 12000 router). The later paper (2003) discusses the terabit routing work, and emphasizes the need to use optics versus electornics (at least in some parts). I would be curious to know how necessary terabit routing really is today ...

More Congestion Avoidance (High-Bandwidth, High-Latency Networks) and Future Internet Design Issues (Real-Time Applications)

I've been absent for the past week (recovering from a concussion) and so for those reading in the blogosphere you will find a rather condensed version of my thoughts (and feelings) of the following papers:
  • Random Early Detection Gateways for Congestion Avoidance
  • Congestion Control for High Bandwidth-Delay Product Networks
  • Fundamental Design Issues for the Future Internet
  • Supporting Real-Time Applications in an Integrated Services Packet Network: Architecture and Mechanism
As the title suggests, these papers address two distinct topics: congestion avoidance/control and future internet design considerations (especially for latency intolerant applications).

The former, congestion avoidance and control, has been discussed to great detail in some previous posts and these authors found yet more issues with existing techniques. Random Early Detection (RED) gateways were found to be effective (if used with cooperating end-nodes) at doing congestion control by doing something rather remarkable: they choose a connection to "throttle down" with the probability equal to the connection's share of the bandwidth! (Note that this still avoids biasing bursty traffic.) I find this to be a nice little result ...

Of course, if Fair Queuing wasn't good enough, Core-Stateless Fair Queuing wasn't good enough, then there is no reason to believe Random Early Detection should be good enough. And it, for the most part, isn't. So, Katabi, Handley, and Rohrs propsed XCP (eXplicit Control Protocol). Briefly, they take ignore backward compatibility and consider a control theoretic design. For one, this means they decouple fairness and efficiency control and provide explicit congestion notifications (versus TCP which has no notifications but instead relies on a timeout which they assume implies a lost packet, i.e. congestion which may be either due to fairness or efficiency). 

The work of Katabi et. al. is especially interesting because it shows how poorly TCP scales with increased bandwidth and latency. These authors used satellite as the canonical example, however, I'm curious if we have actually approached similar limitations with the optic networks of today ... (in other words, how bad is TCP really? how bad is slow-start really?).

The subsequent topic was especially motivated by real-time applications. Such latency intolerant applications were not actually new (they had been doing speech-processing on the Internet since 1980's ... before I was born!), but they had accounted for much of the traffic in networks.

In Shenker's "Fundamental Design Issues ..." he discusses the need to extend the current architecture to include a new "service model". That way, flows that have special needs (i.e. real-time applications) can be treated differently than other latency tolerant flows (i.e. file transfer). He discusses a host of issues, but I found the discussion about how to "choose a service" most interesting. He expands on the issues with implicitly supplied services (the gateway chooses the service) and touches on the biggest problem (in my mind) with explicitly supplied services (the application chooses the service): abuse (applications might find ways to get better network utilization if they lie about their actual needs). Much to my enjoyment he proceeds to discuss "incentives", and how that might quell the issue (which turns into a discussion of pricing, etc).

I'm not sure that we really have done much of anything for real-time applications (at least not in the hardware, clearly, we have made improvements in the software), and so perhaps it is still a valid issue today.

Thursday, September 11, 2008

This paper attempts to resolve the complexity issues in implementing Fair Queueing inexpensively. In retrospect, it actually wasn't made that clear in "Analysis and Simulation of a Fair Queuing Algorithm" that FQ had such issues, so this paper is a nice continuation. Stoica et. al. propose that one can perform some decentralization and push certain complexities out to the edges of networks which in turn will yield better overall performance while still being fair. (This seems to be a design pattern in systems ... not sure exactly which one it would be though).

The authors call their system core-stateless fair queueing because state is only maintained on the edges of network "islands", where the internal routers just use an augmented FIFO policy to do routing. The augmentation of the FIFO algorithm investigates packet tags added by the edge routers which can help determine which packets should be dropped in order to achieve fairness. 

Of the entire paper, I enjoyed Section 4 (Why Are Fair Allocations Important?) the most. As many of my other posts have mentioned, the efficacy of many of these algorithms have in some way relied on cooperation! As the authors go on to discuss, "it seems unsound to base the Internet's congestion control paradigm on the assumption of universal cooperation". YET, they go on to show how, at least for the allocation approach, the algorithms provide a strong incentive for the users of the system to provide some form of responsiveness to congestion control! This is a cool result to me ... of course, this was for drop-intolerant applications. To provide incentives for drop-tolerant flows it may also be necessary to use some sort of punishment scheme.

Analysis and Simulation of a Fair Queueing Algorithm

The authors hypothesize that queueing algorithms play a crucial component in effective network congestion control because: (1) flow control is only helpful for ensuring the presence of free buffer space at a receiver and (2) adaptive routing is only helpful for pushing packets away from network congestion/bottlenecks. Queuing algorithms, they speculate, which are considered only helpful for controlling the order in which packets are sent and the usage of a gateways buffer space, can also affect the way in which different packets from different conversations interact with one another which ultimately will affect the collective behavior of flow control algorithms.

The authors discuss how the most common queuing algorithm, first-come-first-served (FCFS), essentially "relegates congestion control to the sources". They propose Fair Queuing (FQ) as an alternative which allows one to allocate all of bandwidth, promptness, and buffer space fairly. Not only will this allow fair usage of resources, but it also provides a lower delay to lower throughput sources (like Telnet applications). They claim this as one of the prominent features of FQ.

They do some interesting simulations and have some great takeaway points:
  • FQ by itself do not provide adequate congestion control; they must be combined with intelligent flow control algorithms to at the sources.
  • Ill-behaved hosts get much less than their fair share because they have dropped packets even though they still get charged for their throughput.
  • Moreover, FQ protects well behaved sources from their "uncouth brethren", which ultimately rewards good sources.
  • FQ provides incentive for sources to implement Jacobson and Karels or some other intelligent flow control, where as FCFS makes using these "sacrificial".
I especially enjoyed their conclusion on how the game-theoretic perspective is important, in fact perhaps even necessary, to achieve their goals. (Again, this is further justification of how a well specified cooperative environment can perform incredibly well!)

This paper was an enjoyable read and I recommend it remain part of the curriculum. If I had any criticism I would have hoped they did more simulations that included more than one gateway (to see how these mechanisms work across larger networks).

I also think this is a prime source of information for anyone interested in looking at how virtual machine monitors do networking since, essentially, they are are gateway/router of sorts.

Tuesday, September 9, 2008

Congestion Avoidance and Control

The authors investigated whether "the 4.3 BSD (Berkeley UNIX) TCP was mis-behaving or if it could be tuned to work better under abysmal network conditions" after noticing a series of congestion collapses in the late 1980s. The conclusion to their hypothesis came in the form of seven new algorithms that were added to 4BSD TCP, including round-trip-time variance estimation, exponential retransmit timer backoff, slow-start, and others.

They rationalize their algorithms rather non-rigorously using the conservation of packets to explain a system operating stably (i.e. in equilibrium). Their new slow-start algorithm is meant to help the system conservatively reach equilibrium while their new round-trip-time estimation is meant to keep the system in equilibrium. For their new exponential backoff mechanism they also leverage linear system theory!

In addition, the authors also discuss the dynamic window sizing for congestion avoidance. For this they discuss how to perceive the feedback bit that Jain and Chiu propose and effectively get the same results.

This paper feels much like the the implementation of TCP for BSD ... slightly thrown together, all augmentations of an existing system meant to deal with common case situations appropriately (at least that is kind of how I always felt about the BSD TCP implementation). The paper is an interesting and enjoyable read, however, because it captures how novel and exciting networking was at this time and how rich the area seemed for possible optimizations. And yet, since the late 1980s has there been any new TCP implementations that have been adopted? The robustness argument for Jacobson and Karels seems to be, 'its good enough', which, for the most part, it is. Of course, there are a collection of people redesigning the Internet as we speak, but while operating systems have seen radical changes or additions over the years, the BSD networking stack feels like it has remained fairly consistent. (Moreover, almost every UNIX variation includes the BSD networking stack, including Mac OS X.) So, did this work have any important implications ... as I write this blog I'm leveraging exactly this work now!

Monday, September 8, 2008

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks

The authors present an analysis of congestion avoidance (versus control) algorithms for computer networks. Specifically, they show that a simple additive increase and multiplicative decrease algorithm is necessary for convergence to an efficient and fair state (for which they provide definitions of each). In addition they present, albeit very superficially, some possible nonlinear algorithms (and how they actually generalize the linear ones) and practical considerations for implementing these algorithms (including how the sensitivity to system parameters causes nonlinear algorithms to be less useful in practice).

It seems that the authors motivation for pursuing this work was to overcome the increased queuing and congestion caused by the ever increasing heterogeneity of the Internet. In fact, without knowing more history/motivation than what is alluded to in the paper, it seemed that only congestion control mechanisms had been used/included in the original OSI specifications, rather than any congestion avoidance mechanisms.

The authors did a fantastic job of delivering their material. However, I have two criticisms:

First, they choose to assume "information such as X_goal and the number of users sharing the resource ... to be unknown by the users" which, as they say, "restricts the set of feasible schemes". Then, while they do present certain convergence conditions assuming global information and then revise them to satisfy their assumption (i.e. no global information), it seemed the analysis failed to discuss all of the shortcomings due to this loss of information. Specifically, what I would have liked to see, was an analysis that made it clear, for every bit (or abstract bit) of information lost, what functionality/performance the algorithm lost as well. (They do at least mention future work on the "marginal utility of increased bits of feedback".) I'll probably repeat this in my future post on Jacobson's and Karel's "Congestion Avoidance and Control", but I find the tradeoff in these "bits" of information incredibly interesting because, as the networking stack in BSD has shown, you can actually get away without receiving any explicit signals! So that makes me wonder, in the opposite extreme (lots and lots of signals) how much better could we do?!

My second criticism, one that perhaps the authors never dreamed of and yet a majority of the class will point out, was how these algorithms suffer in the face of malicious users. In fact, has there been recent work that has re-evaluated this analysis in the face of such malevolence?

A slight diversion: when considering resource management for parallel operating systems, what are, if any, the right feedback signals to send to applications? Should the new standard of application writing be similar to the way it works in networks? i.e. timeouts is to dropped packets as preemption is to revoked cpus (where I'm assuming a model like scheduler activations where an application is activated when a preemption occurs).

It seems fascinating to me that the solution for "dropped packets", i.e. to throttle down and restart, is one which suggests the necessity to cooperate with and fairly share the resources with other potential users. For most "desktop" applications today, writing code that "throttles down and restarts" (or some analogue) seems incredibly unnatural. In fact, most operating systems today DO NOT provide any feedback (bit or bits) to applications, but rather, keep applications completely agnostic of one another. Perhaps it was much more obvious, from the early days, that networking NEEDED to rely on cooperation to perform well. Now, as computers are becoming ever more parallel, there might be a lot of new ties with networking theory that pushes applications and operating systems into the cooperative realm!

This paper was a great read and provided what I thought was a good example of doing a formal analysis of a system. For that reason, I feel the paper should most definitely remain on the syllabus.

As the authors discuss, there seem to be lots of open research questions (including one I alluded to above that dealt with extra feedback bits). Without doing a literature survey or having greater depth in this area, it is hard to know how much of the future work they proposed is now work completed (or at least attempted).

This work was probably influential in many respects; perhaps the most notable influence this paper had was helping Jacobson and Karel justify (prove seems to strong here) their design decisions for the BSD TCP implementation.

Thursday, September 4, 2008

On Inferring Autonomous System Relationships in the Internet

The author discusses a heuristics based algorithm to determine AS relationships (which are often kept under some non-disclosure agreement for business purposes).

The author first provides necessary background information regarding Internet architecture and BGP routing policies and tables. The author then gives detailed definitions that capture certain graph phenomenon  (downhill versus uphill edges, etc) before eventually discussing the algorithm for classifying transit pairs and edges (provider-to-customer, etc). The author then shows how well their algorthim performed experimentally on data collected from the Route Views server in Oregon.

This paper has a fairly substantial amount of definitions and explanations, which made its overall impression be a bit overwhelming. Unfortunately, it seemed to be the case that the author felt he needed to provide the necessary background information for his audience. That being said, there is a lot of good content that will probably be helpful in getting people to communicate about these issues. My only real critique of the paper was that I had a hard time relating to the motivation of the work and how it will enable future work (probably because this isn't my area).

That being said the paper still provides interesting information (section A. Internet Archtiecture and section B. Routing Policies and BGP Routing Tables) that I found valuable for understanding certain "basic" principles. For example, even statements like "the scalability of the internet routing protocol depends on the contiguousness of IP addresses" which are obvious in retrospect, seemed like a nice and helpful reminder about the "why" of certain Internet decisions.

Other than how this work will enable future work (which as I admitted before I don't completely understand), I think the important ramifications of this work is reinforcing how routing policies can sometimes give you results which are not, for certain customers, preferable. For example, the paper reinforced how local pref attributes override shortest path length attributes! This means my packet may take longer to get where I want it to go because my ISP makes more money by sending it the long way. In fact, it would be really interesting, and cool, to see if we could some how measure this (is it a really pervasive issue in the Internet? or because of the topology this issue rarely comes up?).

Interdomain Internet Routing

Hari Balakrishnan et. al. discuss Internet routing from both a technical and an economic perspective.

He begins his discussion explaining the simple properties of individual commercial entities, called autonomous systems (AS), and how the economics behind their interactions work. Specifically, he discusses the distinction of peering versus transit relationships between ASes and the importance of export and import policies, where the essence is of course "no ISP wants to act as transit for packets that it isn't somehow making money on". Ultimately, routes to paying transit customers are universally exported while routes received from providers are typically selectively exported (exported to paying customers but not peers). Routes are imported given the "customer > peer > provider" priority order, which makes sense monetarily. 

Hari continues the discussion by investigating the technical side of BGP, specifically its three design goals: scalability, policy, and cooperation under competitive circumstances. BGP is essentially a TCP application that communicates differently depending on whether or not the source and destination routers are intra-AS or inter-AS. Eventually, a router determines which routes to use based on a priority of certain attributes regarding the route, a few internal metrics (iBGP versus eBGP) and a tie-break rule. It seems like most routes, however, are really just decided via the LOCAL PREF rule, which essentially reinforces our "customer > peer > provider" ordering, i.e. we should prefer customer over peer and provider routes. 

It is hard to critique informative work like this ... nor do I have any glaring issues. In fact, I think this should definitely be included in the syllabus; it is an easy read and for people like me who had only a superficial understanding of BGP, it was very intriguing.

Hari does discuss a few open problems. The most interesting of which seems like the multi-homing issue. It seems like there has to be better technical solutions than the current solutions, and it might be fun to discuss this problem in more depth in class.

Tuesday, September 2, 2008

The Design Philosophy of the DARPA Internet Protocols

The paper explains some of the motivations and justifications about the design and implementation decisions of the Internet and its well known protocols. Perhaps the most gratifying part of the paper is the discussion about the necessity to separate TCP into IP and eventually add UDP (at least this bit of history clears up the oddity in the TCP versus UDP naming). Of course, this was gratifying because it reinforced the conclusions made in the paper "End-to-End Arguments in System Design".

In connecting these two papers further, I found the statement "a surprising observation about the control of variation in delay is that the most serious source of delay in networks is the mechanism to provide reliable delivery" to be most poignant. That is, it highlights the correctness versus performance trade-off that I discussed in my previous post. In this case it argues that performance may actually suffer from reliable delivery (where as before we were claiming we could/should use it for performance).

End-to-End Arguments In System Design

The paper argued that system design should not try and be overly aggressive in providing functionality that may, in the future, be seen as too restricting, unneeded, or even worse, redundant. The big take away message for me seems to be a recurring theme in computer science ... high-level semantics are hard to predict and reason about, but often are the key thing to exploit when you want good performance. In this case, the canonical example is some sort of telephony service, where the high-level semantics of the application PERMIT lost packets because the clients (in this case humans) can effectively handle the lost packets by, for example, asking for something be repeated. Clearly, knowing about this can help optimize how you actually write your network service.

Another aspect of this paper that I wanted to emphasize was the authors statement that the "reliability measures within the data communication system is seen to be an engineering tradeoff based on performance, rather than a requirement for correctness". Typically, I think that we have considered layering our applications on top of lower-level architectures (like TCP) because they provide a level of correctness for us that we don't have to worry about. It is rather interesting to consider that these authors believe the original intention of "reliability" was actually to benefit the performance of the application rather than the correctness of the application ... in my experience, applications get better (best?) performance when they can fine-tune everything to their needs, including, their own "reliability" semantics.