Remembering How to Hack

September 3, 2009

I’m currently on a European travel excursion (post on that should come shortly) and right now I find myself in Berlin. In talking with Jan Lehnhardt last night after dinner I mentioned that I had been away from coding for a while. He joked, “What? A week?”.

I paused… “More like three months”. Really, though, it’s more like six.

Upper level Computer Science classes don’t always make you code that much. A seminar on distributed systems is lovely, but it’s not hacking. Hacking, as I understand it (and as Christopher Kelty’s Two Bits has been explaining to me throughout my trip) has always placed a lot of value on Working Code. You can talk all you want (as I did in my last post), but ultimately it’s all a lot of hot air until you have working code. Even when you have to write code for class, the projects are often highly specified, you work in a (relatively) closed environment (in terms of code sharing) and it’s likely your project has been done hundreds of times by other students before. That doesn’t feel like hacking.

I felt that my last post had an overly academic tone (no surprise as it was initially drafted as an e-mail to a classmate). It was a nice mental exercise.

Some database systems do put clustering at the storage layer. CouchDB seems to be going the another way, with projects like lounge and rumored efforts to port similar functionality to erlang. (FWIW I’ve decided I like this approach better, in retrospect, based on the idea that the core system should be simple and layers provide good separation of concerns.) I could sit and debate in my head all I want about which way is better. I could make blog posts and I could chat on IRC (I have). None of this is working code. It’s all just academic.

Now that I’ve recently graduated I’m looking forward to remembering what it’s like to write code. Hopefully, I’ll write lots of it. Maybe soon my github account will actually have some activity on it.

I’ve been falling asleep and waking up with these thoughts on my mind for almost 6 months now, so I think it’s about time to share them and get some feedback. To this end I’ve (finally) started a blog.

First, definitions:
Client – a device participating in the distributed system.
Node – a B+tree node.

A database (such as CouchDB)  is commonly indexed using some form of a B+tree. The task is one of effectively distributing this tree. If it can be distributed in a dynamically balanced way: success. The term ‘dynamic’ here refers to variability in the number of participating clients.

I claim that a (mostly) consistent tree is not fatally expensive.

Traversals cost network round trips because pointers between nodes are potentially pointers between clients. Therefore, caching inner nodes so that most operations do not incur network latency when traversing is essential. Forgo global consistency by having clients act on stale representations of the tree, using per-node MVCC to verify traversals at the leaf [1].

Another optimization not yet well thought out: Allow the client which ‘owns’ a node to decide when to split it. It could be split locally but advertised to other clients as merely overfilled. This rule creates ‘hidden depth’ beneath (what appears to most clients to be) a leaf node, deferring the impact of insertion until such time as the depth is publicly revealed and repaired. Open question of possible concerns/consequences.

It is possible to route on a statically configured network or on an overlay network.
The overlay network is much more interesting and is the motivation for a dynamic solution. If the overlay can account for and deal with client churn the result is a database that scales without reconfiguration. The real task is to assign tree nodes to overlay clients in a dynamically balanced way.

Leaf nodes are all always at the same tree depth, a feature not unique to B+trees. Therefore, reasonable approximations of client ordering make slicing the leaf nodes a trivial task. There exists at least one gossip-based protocol which claims to solve distributed slicing with fast convergence [2]. However, the consequences of disagreement might be vast if sufficient care is not taken (slicing many leaf nodes to k vs. k+1 buckets could produce vastly different results), but increasing the number of buckets decreases the error.

(Note: The problem of distributing inner nodes is not terribly worrisome, since they should not be accessed for most operations.)

It’s possible to enforce an upper bound on the ratio log_b(n)/k of leaf nodes to buckets (n = number of nodes, k = number of slices, b = order of the tree) by using virtual buckets that have a many-to-1 correspondence with clients, but this appears only to defer the problem.

So where do we go from here? I’m going to fall asleep and wake up to this problem until I get some answers, as I’ve been doing for months now.  Any help is appreciated. Please – for my sanity.

  1. Marcos K. Aguilera, Wojciech Golab, and Mehul A. Shah, “A practical scalable distributed B-tree,” Proc. VLDB Endow. 1, no. 1 (2008): 598-609.
  2. V. Gramoli et al., Sliver, a fast distributed slicing algorithm (Technical report, Cornell University, 2007. http://www.cs.cornell.edu/projects/quicksilver/public_pdfs/sliver_final.pdf).
Reblog this post [with Zemanta]