Monday, November 24, 2008

The Modern Data Center

I've titled this post "The Modern Data Center" because I'm discussing papers that are targeted at the modern data center: either (1) a private data center that has been heavily administered or (2) a public data center that is used in a best-effort style without any sort of service level agreements. It is important to realize, of course, that administration of your resources in public best-effort data centers might still be valuable and therefore (1) is not completely disconnected from (2). The papers I'm discussing are "A Policy-aware Switching Layer for Data Centers" [PLAYER], and "Improving MapReduce Performance in Heterogeneous Environments" [LATE].

The authors of [PLAYER] were attempting to provide a better solution to the current state-of-the-art in network setup and administration. The main issue was that the current techniques were incredibly inflexible, sometimes inefficient, and sometimes incorrect. Their solution was to introduce a new layer into the stack called the PLayer that allowed them to route traffic in such a way that the network could change dynamically. This provided, for example, the ability to add and remove middleboxes effortlessly. In existing deployments, middleboxes are typically connected in series even if not every piece of traffic needs to be handled by that middlebox (this motivates the inefficiency argument). In such a deployment, adding or removing a middlebox would require very sophisticated configuration or taking all the machines in that series offline while performing any changes.

I completely support the motivations driving the authors of [PLAYER], but I'm not convinced that their solution is the right one (it seems a bit too involved). For example, taking middleboxes off of the physical network path seems like the right thing to do. In general, any deployment that gives you the "logical" arrangement that you want without forcing you into any specific "physical" arrangement is preferred because it gives you flexibility. The authors do at some point suggest that these indirect paths may introduce an efficiency concern, but I'm not convinced as much.

My issue with [PLAYER] is nothing more than the fact that I feel like they could have solved this problem simply using netfilter. In fact, I can only one main advantages to using their system that would not have been achieved by default in netfilter: they have a single administrative source (versus administering netfilter at each of the routers and middleboxes) that disseminates the configuration appropriately. I actually see this as a HUGE selling point, although perhaps they could have instead created a "distributed netfilter". Another way of looking at this is that perhaps this is great research because in retrospect it is ... obviously a good idea! To the credit of the authors they provide lots of great intellectual arguments for why their setup is robust in certain network settings, which I think could also be applied to the right setup of a distributed netfilter.

The authors of [LATE] discuss even more modern usage of data centers commonly dubbed cloud computing. These authors specifically wanted to explain why the open-source MapReduce variant Hadoop performs poorly when the cloud is heterogeneous. Specifically, they propose and justify a new scheduler, Last Approximate Time to End (LATE), to be used within Hadoop that outperforms the native Hadoop scheduler for most of their tests.

For the most part, I like the work behind [LATE]. The authors investigated a novel piece of cloud computing, found some deficiencies, and proposed good, well-founded solutions. Perhaps my only criticism of the [LATE] work is that I don't share the authors views that the Hadoop maintainers made implicit assumptions about the environments in which Hadoop would be used. Rather, I'm guessing something more practical occurred: Hadoop was implemented using a homogenous cluster and fine-tuned under that setting. Of course, I'm not sure about this, but I'm dubious that if the maintainers of Hadoop had used a cloud computing architecture like Amazon's EC2 that they would still have made the scheduling implementation decisions they made. In fact, I would be very interested to see what Jeff Dean and others have to say about this work ... I'd be curious if they had experienced any sort of heterogeneous effects themselves.

Tuesday, November 18, 2008

One Size Doesn't Fit All

I have to use one of my favorite sentences from any paper to start off this post (slightly modified): "it is fundamentally impossible to provide an abstraction in a way that is useful to all applications and to implement these abstractions in a way that is efficient across disparate application needs" [Exokernel]. In many ways, this is reminiscent of the end-to-end argument!

So what am I blabbering about today? Two papers: "An architecture for Internet Data Transfer" [DOT]  and "A Delay-Tolerant network Architecture for Challenged Internets" [DELAY].

The theme I've chosen is this notion that one size doesn't fit all, which is very much the motivation behind DOT. Technically speaking, they advocate the separation between a "control" channel and a "data" channel and a clean interface hiding the actual mechanisms of transport. That way, when data actually needs to be exchanged the application can select some generic "policy" such as best-effort or reliable, high-bandwidth or low-latency, etc, and the system can figure out how to get the data there. In my opinion, I wish this layering existed in the first place ... instead of being stuck with TCP and UDP. That is, TCP and UDP is not a size (abstraction) that fits all, but we don't have any other options! Of course, practically speaking it is hard to deploy other options, and even DOT would require applications to be modified and redistributed.

Nevertheless, I'm definitely a supporter of something like DOT. In fact, one of the possible applications of DOT is the ability to exploit multiple network paths, which is exactly what I'm presently working on! They even show the types of savings you can get when using multi-paths, which can be as high as 50%!

One of the other benefits of providing a more well defined transport interface (or less, depending on how you think of it) is that you can create transport policies that adhere well to certain network environments, like "challenged networks". This brings us to our second paper, where the authors discuss a need for a network architecture that can be robust in the face of many possible "failures" (mostly latency issues). Ironically, these "delay-tolerant" networks seem to make Vern Paxson's pathological cases the common ones.

I've always had my issues with delay-tolerant networks:
  • I'm concerned about all the "internets" we set up in developing regions will soon be overrun by "real" infrastructure (you see it all time ... in fact, some "third world" countries have better infrastructures simply because an industry has moved in and paid the initial up front cost knowing they can reap benefits in the future). This makes me weary of trying to create delay-tolerant GENERAL PURPOSE networking (for HTTP, etc). 
  • In fact, I see delay-tolerant networks as being great for specialized settings, under specialized environments, using specialized hardware and specialized software. Military is a specialized environment, sensor nets getting telemetry information, also specialized.
I think there are lots of interesting research questions regarding DTN, I'm just not convinced that it is something we would want for the general purpose Internets.

Thursday, November 13, 2008

Understanding the Series of Tubes ...

The Internet is massive and complicated and this makes it difficult for researchers to understand its deficiencies and strengths and developers to write applications on top of it. In many ways there isn't really any such thing as a "debugger" for the network, and that makes it hard to do performance debugging (researchers) and logic debugging (application developers).

I'm discussing two papers here, the first is "End-to-End Internet Packet Dynamics" [DYNAMICS] and the second one is "X-Trace: A Pervasive Network Tracing Framework" [XTRACE].

The author (Vern Paxson) of [DYNAMICS] was very interested in understanding network pathologies like out-of-order delivery, packet replication, and packet corruption. I do find it rather ironic that these are "pathologies" considering the architects of the Internet wanted to be robust in exactly these cases. In fact, I don't know how far off base I would be in suggesting that the original architects of the Internet saw these pathological cases as much more of the norm than what we consider today. Vern even comments that in the presence of these pathologies certain network assumptions are no longer applicable ... this implies to me that the community has generally accepted the fact that the common case is no out-of-order delivery, no packet replication, and no packet corruption.

So how devastating can these pathologies be? Well, Vern claims that reordering is not common on all paths, but it can be incredibly common on some paths, especially paths that have frequent route fluttering. In general, however, it seemed the consensus was that reordering generally has little impact on TCP performance. The exception, however, is when trying to determine the duplicate ack threshold. In some cases it might be beneficial to delay the dups to better disambiguate packet loss from reorderings (disambiguating losses seems to be a big theme for a lot of papers ...).

And packet replication? Not such a big deal, apparently. It only seems to occur due to buggy hardware or software configurations. 

And packet corruption? About 1 in 5,000; given a 16-bit checksum this means on average one bad packet in 65,536 will be erroneously accepted resulting in undetected data corruption! Those don't really seem like the best odds ... or perhaps I am missing something here.

To measure bottleneck bandwidth of routes Vern uses packet pair, which I thought was an elegant little trick. The packet pair technique works as follows. Send two packets with spacing x and measure what their spacing is when they get to the receiver. The amount the were "spread out" should tell you the bottleneck bandwidth on a path!

Vern continues his study by looking more closely at packet loss and the efficacy of retransmissions. He has a nice classification of the three types of retransmissions as, "unavoidable", "coarse feedback", and "bad retransmission timeout". This is interesting because it reminds us that in our systems there are two kinds of losses: real losses versus measurement losses (when our tools tell us there is a loss but there really wasn't one). Clearly we want to minimize the amount of retransmissions given the latter. It was even more interesting to see that there were still some buggy implementations of TCP out there that mad it such that the "unavoidable" retransmissions were not the most frequent!

This paper had lots of goodies in it ... and I could probably go on and on and on. Instead, I'll switch gears and discuss X-Trace a bit.

The X-Trace authors were trying to solve a real problem that clearly had plagued some of them before. I know I've been trying to build Internet applications and wondered aloud what in the heck could possibly be happening to my packets? In fact, I'm working as we speak on wireless mobility and some of my results really make me scratch my head ... if I could know exactly where my packets were going, that would be amazing!

Of course, that is a lot to ask for: it is hard enough to get lots of administrative domains to implement the current standards, it seems dubious that we can impose more standards on top. Of course, I thought the authors of X-Trace made two points very clear: (1) X-Trace can be implemented incrementally and (2) the overhead of adding X-Trace is not that substantial from a cognitive perspective or a performance perspective. Both (1) and (2) make X-Trace much more appealing to the community.

The X-Trace concept is relatively straightforward: application layers must perform "pushNext" and "pushDown" events on packets in order to propagate the metadata along the path. The metadata must travel in-band in order to be useful, but the resulting trace can be returned in any manner appropriate (out-of-band, delayed, abridged, not at all, etc). Once a user has gotten as much of the trace as possible, they can perform an offline computation of the data to get the resulting "task tree".

I really liked the idea of X-Trace ... from a completely practical perspective it would be incredibly helpful. In fact, I wouldn't be surprised if a company like Google, that writes or rewrites all of the software used internally, has tried to develop a similar architecture. The benefit they have, of course, is that they have complete control of all of their machines and can make all the necessary changes.

I suppose the counter-argument for something like X-Trace is that some people really don't want others to know how data is making its way through the Internet. It could be an economic concern or a security concern (although security through obscurity is not the right way to go). To that effect, it is really unclear to me the community would respond (or did respond) to this sort of work.

Thursday, November 6, 2008

Internet Indirection

The theme for today's discussion is Internet indirection! Middleware like firewalls and network address translation boxes have proliferated in the Internet but are often scorned upon as breaking the end-to-end design argument. Specifically, a firewall filters packets by explicitly looking at packets that are not addressed to it (and depending on the firewall looking as far up as the data payload of the higher level protocol). A NAT on the other hand might be considered a worse or less offender because it actually rewrites the packet header as is necessary (changing the source address or destination address).

Two papers address possible alternative Internet architectures that allow, or even encourage, indirect processing of packets. The two papers are, "Middleboxes No Longer Considered Harmful" [MIDDLE] and "Internet Indirection Infrastructure" [i3].

The [MIDDLE] paper can actually be thought of as an extension to [i3] where they focus exactly on allowing the notion of middleboxes like firewalls and NATs. In fact, in many ways it seems like [i3] subsumes [MIDDLE] and provides the more generic architecture. In fact, after reading [i3], I wasn't horribly impressed with [MIDDLE]!

The [i3] work, however, I found very elegant and appealing. The principle of i3 is rendezvous routing. Essentially, a server can place a public trigger on the Internet and end hosts can try and send data to the id of that trigger which then performs routing lookups to determine the ultimate destination IP. The indirection is the fact that you send to these identifiers rather than sending to an IP address and the identifiers are stored uniquely and can be updated as the server sees fit (as it moves, for example). Great idea, in my opinion.

Of course, how much different is this then requiring that every single send of data require a DNS lookup? As long as I can update my DNS name server every time I change my name and have it resolve my mapping with a TTL of 0, then it seems like I can more or less accomplish the same thing. The crux is how efficiently this could be performed for something the size of the entire Internet. The basis of DNS is that caching provides performance, and in the absence of caching using a lookup mechanism like DHT might be the absolute right way to go.

To allow middleware the i3 has a notion of a stack of identifiers. This allows sending data to a certain identifier which itself has a stack of identifiers that the packet must traverse, in a way, very much like source routing. Of course, now, because each IP packet is addressed to the right node, it can do whatever transformations of the packet it sees fit. The other added benefit is this model is that you don't need the middleware to exist physical in between an end host and the rest of the Internet! The authors of [MIDDLE] discuss the advantages of this in depth, although they also mention that the physical separation can some times be advantageous for security.

The identifiers in i3 were interesting because, while they did provide for this indirection, they also make it clear that identifiers should be picked in a way which provides geographical/network proximity. In fact, they even mention making one part of the identifier fixed for geography and the other part vary (the "true" identifier). This sounds strikingly similar to a hierarchical address ... (IPv6?).
In terms of performance, [MIDDLE] seemed to have lots of unfortunate overheads. I was rather disappointed with their discussion on packet overheads do to their extra header ... obviously there will be some packet overheads! Essentially, overlay solutions like these might suffer from extra round trip latencies, extra processing time (to figure out, for example, where to route to next), etc. Given these, it make sense that we haven't seen these take over in the Interent ... at least not yet. Of course, there are lots of companies trying to use the overlay and P2P like model to multicast stream live media.  

Tuesday, November 4, 2008

Domain Name System

The Domain Name System (DNS) is probably the best know example of a well used distributed system that the Internet has to offer. I'm discussing two papers in this blog post related to DNS, "Development of the Domain Name System" [DEV] and DNS Performance and the Effectiveness of Caching" [PERF].

The authors of [DEV] gave a report on the development of DNS and learned lessons. First, they clearly make a point of trying to justify that their system can scale. They even claim that close to "50% of all root server traffic could be eliminated by improvements in various  resolver implementations to use less aggressive retransmission and better caching". Interestingly enough, the authors of [PERF] claim that implementations are still "overly persistent"! In fact, they say that implementations still incur many more retransmissions than are necessary. It seems ironic (in the sense that there is all this other distributed systems research) that the entire DNS system scales based on the simple ideas of redundancy and caching!

While the authors of [DEV] claim that caching was a success ... they don't mention some of the security issues (or mention security at all) that makes caching a potential problem. Specifically, there have been techniques that have taken advantage of DNS caches (cache poisoning). I'm not sure of the status of these exploits ...

The authors of [DEV] also discussed how negative caching can be effective even though it conflicts with their intuition. The authors of [PERF] actually show that 13% of all lookups actually result in a negative response. This actually seems very realistic to me because lots of applications try and do reverse lookups on IP addresses. However, lots of name server implementations choose simply not to include that information because the IP addresses that they map are constantly changing (I believe dynamic DNS has this problem).

I'll close with perhaps my favorite part of the discussion, why performance of the DNS system was better than use of the local HOSTS.TXT file. As the authors state, "the old mechanisms were created for a much smaller database and were not adjusted as the size of database grew explosively". That is, DNS was immediately more efficient than doing a search through the really large HOSTS.TXT file! This seems like a lucky break for the DNS folks ...

Thursday, October 30, 2008


The truth is, P2P is a really cool research topic, even if its number one application is sharing content illegally. Thus, I'm spending this post discussing the primary role of P2P: looking up data (which arguably most early P2P applications got wrong). The paper I'll be addressing most is Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications.

First, it should be clear that lookup is not inherently an easy task. Or at least if we assume that we want resiliency to failures, i.e. no central database, or even a collection of root databases under some hierarchical scheme (like DNS). Given that, a straw man approach would be something like broadcasting the message out ... but that clearly sucks. We would prefer a solution that bothers the least number of nodes possible. The authors of [CHORD] present an O(log N) solution! They do have an assumption, of course, which is that the individual nodes are willing to store information about content that they might not themselves actually have (i.e. they store a pointer(s) to who does actually have the content, but they are the ones responsible for resolving any lookups for that data).

Some key parts of Chord:
  • Essentially a distributed hash table.
  • Uses consistent hashing, which has the nice property that each node in the system sees approximately an equal number of keys, and when a node joins or leaves the system only O(1/N) keys need to be moved to a new node.
  • Avoids the requirement that every node knows about every other node. Each node only needs to store information about O(log N) nodes; this is especially helpful in routing and keeps lookups at O(log N).
  • Was implemented for correctness, i.e. handles potentially conflicting concurrent operations (joining or leaving, failure, etc) well.
The authors show a few simulations of Chord, as well as one real Internet prototype, in which it scales well as the number of nodes increases. I'm curious, however, how well this performs in smaller settings where latencies are actually much lower (like within a single data center). Lots of applications are being written in distributed styles where they have lookup needs that they would like to perform efficiently. I'm curious if the design decisions that went into Chord in any way limit it from performing as optimally as it could in a smaller data center like environment ... or if there are instead much better mechanisms for doing lookups there.

Tuesday, October 28, 2008

Overlay Networks

I'm addressing two papers in this post: Resilient Overlay Networks [RON] and Active network vision and reality: lessons from a capsule-based system [ACTIVE]. Both papers are attempting to address the shortcomings of the existing Internet architecture.

The authors of [RON] propose creating an overlay network with three goals in mind, "(i) failure detection and recovery in less than 20 seconds; (ii) tighter integration of routing and path selection with the application; and (iii) expressive policy routing".

The authors of [ACTIVE] also propose to create an overlay network, however, they state their main goal as simply "allow untrusted users to control the handling of their own packets within the network", which, more or less, is the aim of item (i) and (ii) of [RON], however, [ACTIVE] makes it much more explicit and therefore seems to have been harshly criticized (why would I ever want someone else's code to run on my router!).

While the authors of [ACTIVE] do mention that loss information is "clearly useful for congested-related services", they don't seem to make loss detection and recovery a top motivation for their work. (Because, come on, their work can effectively subsume [RON] ...)

In [RON] however, loss is a huge, if not the ultimate motivation. The issue is this: when routes are lost in the network the existing BGP protocol is ineffective, taking on the order of minutes to stabilize a new route. For lots of applications, this is unacceptable (consider a VoIP call, not only will we get "disconnected" but I can't even call back and try again ... unless I wait for a few minutes). As the authors of [RON] point out, the users perception of a "disconnected" connection is not on the order of several minutes like BGP imposes, but someone might be willing to hang out as long as about 20 seconds.

Of course, measuring path outages is hard, which [RON] makes clear. This reinforces, once again, that perhaps, while simple and elegant, using only a single bit of information for congestion and loss is unacceptable! I liked their discussion of failures: they categorized them in terms of either the network's perspective or the applications. The network sees link failures and path failures while an application sees only outages and performance failures.

In general, between both [RON] and [ACTIVE] I was a little concerned about adding this "hammer" on top of the existing Internet architecture. The authors of [RON] even acknowledge the fact that it is unclear how effective a RON will actually be if it became a standard (specifically, what happens when everyone is sending "active probes" to figure out route information ... I mean, should active probes actually be the way we figure out routes?!).

In the case of [RON], why can't we try and get a more responsive BGP to get fast detection and recovery? Why can't we get a better TCP as the basis for applications (so that they can more tightly integrate)?

In  the case of [ACTIVE], do programmers really want, or need, or make decisions about routing/forwarding at each and every hop? They even say themselves, "... it is clear that greater application experiences is needed to judge the utility of active networks. No single application ("killer app") that proves the value of extensibility has emerged, and none is likely to ..."!

Speaking of extensibility, I disagree with the analogue that the authors of [ACTIVE] draw in saying that "Active networks are no different in this regard than extensible operating systems, databases, or language implementations". I don't see it so optimistically ... when you are using disparate resources managed by lots of different people (unlike code running in an operating system on ONE computer) then extensibility is clearly not always what you may want (again, I don't want some other person's code to execute on my router!).

The authors of [ACTIVE] do provide some other valuable insights. Specifically, that systematic change is difficult and it seems necessary to be able to do experimentation without requiring deployment. Of course, isn't that what PlanetLab was for? In addition, the authors of [ACTIVE] discussed how their work may clash with the end-to-end argument, but they claim that it is not an issue (and lots of deployed services already do things that break the end-to-end argument ... firewalls, proxies, etc).

Ultimately, it seems that the authors of [ACTIVE] see their work as being most compelling for network layer service evolution ... which seems like a good fit.