The chunk store in arrow is essentially a content-addressable hash table. This means that it maps a hash to a block of data, and the hash is the concatenation of a simple checksum (which is a well-known rolling checksum) and an MD5 digest of the block. These checksums are used in the file layer to figure out the contents of files, and find blocks of data that are redundant across files or file versions.
This hash table can grow extremely large. It might overflow the size of a reasonable file or even overflow a single disk, and we’ll have no idea how large it will be at the beginning. So, it needs to grow gracefully, and still remain as fast as possible — ideally, staying near O(1) complexity.
The solution arrow uses has been around for awhile, first proposed in 1980. It’s called linear hashing, and it’s a pretty neat technique, since it’s so simple. In linear hashing, say you have a strong hashing function that maps values to some large, but fixed-size (and smaller than your data) key space. In the case of arrow, this is a 20-byte value, the checksum pair, which maps each data block to one value in 2160. So, we can expect to store about 280 blocks before a collision, which is a lot. We’re likely to never need anywhere near 264 blocks, even.
The next part is where key/value pairs are stored. Instead of mapping a single key/value pair to a single file, we’ll map small collections of keys into fixed-size buckets, so instead of n files, we’ll have n/m files, where each file stores about m values. What we want is a way to map a large key to one of these files, and we want to be able to add files as the number of entries increases, so that if a certain file contains too many entries (say, it’s load factor grows beyond a threshold), we’ll split that file into two, copying about half the entries from the old file to the new one.
If we have f files, we can just compute the remainder of the key divided by f, and get the file to store that key in. If the number of files is of the form 2i, this is just extracting the lower i bits of the key, much better than a large-integer division. So, the question is, how can we guarantee that we just need to extract the low-order bits each time?
Here’s the algorithm:
x = hash(data) & (2^i)-1
if x < n
x = hash(data) & (2^(i+1))-1
The variable n is the next store to be split, and starts at 0 and increments until it hits 2i-1. Then, we reset n to 0 and increment i. So if n=0, and i=2, then a key that was previously mapped to file 0 might be mapped to store 4. If our hash function is pseudo-random, then splitting a file will move about half the entries to the new file. This lets us split the files one at a time, and the store we split is very likely to be the one that needs to be split next, since it's the oldest file, and thus the one with the most entries.
Arrow implements this for its storage back-end, limiting each file to 256 entries of 1024 bytes (chunks are variable-sized, though, so the limit is a little fuzzy), and each file's name is just its number, encoded in base-64. Files are split when they become 70% full (currently by key, but we could do it by data used, too). So we get file A, B, C, etc. My first tests went extremely well; I instrumented inserting about 2MB, and each file split into about halves — it always was between 45% and 55%.
Linear hashing is a pretty neat technique, and is astoundingly simple to implement. I'll keep testing it, to see how well it holds up to, say, gigabytes of data, but I'm very happy with this implementation.

Loading...
Post a Comment