Tuesday, October 28, 2008

Overlay Networks

I'm addressing two papers in this post: Resilient Overlay Networks [RON] and Active network vision and reality: lessons from a capsule-based system [ACTIVE]. Both papers are attempting to address the shortcomings of the existing Internet architecture.

The authors of [RON] propose creating an overlay network with three goals in mind, "(i) failure detection and recovery in less than 20 seconds; (ii) tighter integration of routing and path selection with the application; and (iii) expressive policy routing".

The authors of [ACTIVE] also propose to create an overlay network, however, they state their main goal as simply "allow untrusted users to control the handling of their own packets within the network", which, more or less, is the aim of item (i) and (ii) of [RON], however, [ACTIVE] makes it much more explicit and therefore seems to have been harshly criticized (why would I ever want someone else's code to run on my router!).

While the authors of [ACTIVE] do mention that loss information is "clearly useful for congested-related services", they don't seem to make loss detection and recovery a top motivation for their work. (Because, come on, their work can effectively subsume [RON] ...)

In [RON] however, loss is a huge, if not the ultimate motivation. The issue is this: when routes are lost in the network the existing BGP protocol is ineffective, taking on the order of minutes to stabilize a new route. For lots of applications, this is unacceptable (consider a VoIP call, not only will we get "disconnected" but I can't even call back and try again ... unless I wait for a few minutes). As the authors of [RON] point out, the users perception of a "disconnected" connection is not on the order of several minutes like BGP imposes, but someone might be willing to hang out as long as about 20 seconds.

Of course, measuring path outages is hard, which [RON] makes clear. This reinforces, once again, that perhaps, while simple and elegant, using only a single bit of information for congestion and loss is unacceptable! I liked their discussion of failures: they categorized them in terms of either the network's perspective or the applications. The network sees link failures and path failures while an application sees only outages and performance failures.

In general, between both [RON] and [ACTIVE] I was a little concerned about adding this "hammer" on top of the existing Internet architecture. The authors of [RON] even acknowledge the fact that it is unclear how effective a RON will actually be if it became a standard (specifically, what happens when everyone is sending "active probes" to figure out route information ... I mean, should active probes actually be the way we figure out routes?!).

In the case of [RON], why can't we try and get a more responsive BGP to get fast detection and recovery? Why can't we get a better TCP as the basis for applications (so that they can more tightly integrate)?

In  the case of [ACTIVE], do programmers really want, or need, or make decisions about routing/forwarding at each and every hop? They even say themselves, "... it is clear that greater application experiences is needed to judge the utility of active networks. No single application ("killer app") that proves the value of extensibility has emerged, and none is likely to ..."!

Speaking of extensibility, I disagree with the analogue that the authors of [ACTIVE] draw in saying that "Active networks are no different in this regard than extensible operating systems, databases, or language implementations". I don't see it so optimistically ... when you are using disparate resources managed by lots of different people (unlike code running in an operating system on ONE computer) then extensibility is clearly not always what you may want (again, I don't want some other person's code to execute on my router!).

The authors of [ACTIVE] do provide some other valuable insights. Specifically, that systematic change is difficult and it seems necessary to be able to do experimentation without requiring deployment. Of course, isn't that what PlanetLab was for? In addition, the authors of [ACTIVE] discussed how their work may clash with the end-to-end argument, but they claim that it is not an issue (and lots of deployed services already do things that break the end-to-end argument ... firewalls, proxies, etc).

Ultimately, it seems that the authors of [ACTIVE] see their work as being most compelling for network layer service evolution ... which seems like a good fit.

1 comment:

Randy H. Katz said...

I think RON is most interesting for modest deployments of application-specific functionality. For example, most multicast applications, such as multipoint conferencing or source-to-many video streaming could make use of these schemes if the community is small -- the rule of thumb is order 50 nodes. Note that these could be landmarks within much larger networks, with the ability to redistribute packets locally once received via a wide-area RON. It is true that the R is for Resilient, which is not always needed for applications like this.