Monday, November 24, 2008

The Modern Data Center

I've titled this post "The Modern Data Center" because I'm discussing papers that are targeted at the modern data center: either (1) a private data center that has been heavily administered or (2) a public data center that is used in a best-effort style without any sort of service level agreements. It is important to realize, of course, that administration of your resources in public best-effort data centers might still be valuable and therefore (1) is not completely disconnected from (2). The papers I'm discussing are "A Policy-aware Switching Layer for Data Centers" [PLAYER], and "Improving MapReduce Performance in Heterogeneous Environments" [LATE].

The authors of [PLAYER] were attempting to provide a better solution to the current state-of-the-art in network setup and administration. The main issue was that the current techniques were incredibly inflexible, sometimes inefficient, and sometimes incorrect. Their solution was to introduce a new layer into the stack called the PLayer that allowed them to route traffic in such a way that the network could change dynamically. This provided, for example, the ability to add and remove middleboxes effortlessly. In existing deployments, middleboxes are typically connected in series even if not every piece of traffic needs to be handled by that middlebox (this motivates the inefficiency argument). In such a deployment, adding or removing a middlebox would require very sophisticated configuration or taking all the machines in that series offline while performing any changes.

I completely support the motivations driving the authors of [PLAYER], but I'm not convinced that their solution is the right one (it seems a bit too involved). For example, taking middleboxes off of the physical network path seems like the right thing to do. In general, any deployment that gives you the "logical" arrangement that you want without forcing you into any specific "physical" arrangement is preferred because it gives you flexibility. The authors do at some point suggest that these indirect paths may introduce an efficiency concern, but I'm not convinced as much.

My issue with [PLAYER] is nothing more than the fact that I feel like they could have solved this problem simply using netfilter. In fact, I can only one main advantages to using their system that would not have been achieved by default in netfilter: they have a single administrative source (versus administering netfilter at each of the routers and middleboxes) that disseminates the configuration appropriately. I actually see this as a HUGE selling point, although perhaps they could have instead created a "distributed netfilter". Another way of looking at this is that perhaps this is great research because in retrospect it is ... obviously a good idea! To the credit of the authors they provide lots of great intellectual arguments for why their setup is robust in certain network settings, which I think could also be applied to the right setup of a distributed netfilter.

The authors of [LATE] discuss even more modern usage of data centers commonly dubbed cloud computing. These authors specifically wanted to explain why the open-source MapReduce variant Hadoop performs poorly when the cloud is heterogeneous. Specifically, they propose and justify a new scheduler, Last Approximate Time to End (LATE), to be used within Hadoop that outperforms the native Hadoop scheduler for most of their tests.

For the most part, I like the work behind [LATE]. The authors investigated a novel piece of cloud computing, found some deficiencies, and proposed good, well-founded solutions. Perhaps my only criticism of the [LATE] work is that I don't share the authors views that the Hadoop maintainers made implicit assumptions about the environments in which Hadoop would be used. Rather, I'm guessing something more practical occurred: Hadoop was implemented using a homogenous cluster and fine-tuned under that setting. Of course, I'm not sure about this, but I'm dubious that if the maintainers of Hadoop had used a cloud computing architecture like Amazon's EC2 that they would still have made the scheduling implementation decisions they made. In fact, I would be very interested to see what Jeff Dean and others have to say about this work ... I'd be curious if they had experienced any sort of heterogeneous effects themselves.

No comments: