I’m working on a project for my Master’s degree — or at least, I’ve been procrastinating and batting around ideas for the past quarter, and now I’m trying to get something finished.
My plan right now was suggested by my advisor, and it fit really nicely with my experience to date.
I’m calling it “arrow” right now, for no particular reason. The idea is that it is a file backup service, meant to be efficient and recoverable.
Files are stored first as linked lists of metadata info: the head of the list is the most recent version, and subsequent versions are stored in the rest of the list. Each entry contains the usual information about the file, it’s name, times, permissions, etc., and the file data as a list of chunk references. Chunk references are just hash pairs, and chunk lengths (though, I may support putting chunks directly into the chunk list, if they are small enough). The hash pair consists of a simple checksum based on Adler-32 (and we’ll see why this is important below) and a strongly collision-resistant checksum, probably MD5 (we don’t need it to be cryptographically secure here, so MD5, which is very fast, is sufficient). Directories are just stored as lists of pointers to files.
Adding a new version of a file consists of fetching the most-recent revision of the metadata file, and performing a checksum search on the new version, to determine what chunks of the file have changed, and which need to be sent to the backup server. The checksum searching is, as you may have guessed, derived from the rsync algorithm. We can get a double savings with this scheme — delta compression in sending new copies of the file, and delta compression on the storage server, since redundant blocks will only be saved once. Also, we don’t even need any computation on the server side: the rsync checksum list is the file,
Other systems exist that do similar things, but I’m experimenting here, and want to add some features that will both verify and fix storage errors.
Chunk storage will be rather unique; chunks will be stored in small blocks of data — probably around a megabyte or so — and will store in a header all the identifiers (checksum pairs) of chunks stored in the block, along with an offset into the block where the chunk data may be found. Since each block is small, we can use a linear search and be fine. The tricky part is mapping an identifier to a block, for which we’ll use a linear hashing technique: we compute the chunk identifier (it’s hash code) modulo 2i or 2i+1, depending on the last block that was split. This gives the block identifier, and since the divisor is a power of 2, this just translates into extracting the lower i or i+1 bits of the block identifier, so it’s very efficient.
Lastly, each block has appended to it a list of parity bytes, computed over fixed-size parts of the block (about 256 bytes, which won’t necessarily overlap the boundaries of chunks stored in the block). We can verify the contents of the block by computing the checksums again, and comparing them to the identifiers stored in the header. If a checksum fails to verify, we can locate very quickly which sub-block was corrupt, since only a few sub-blocks will overlap that chunk. If two adjacent chunks fail to verify, we know that it’s at least the sub-block that overlaps both. If all blocks fail to verify, or perhaps a significant number of them, we know that a sub-block in the header was corrupt. Given parity information such as a Reed-Solomon code, we can correct the corrupt sub-blocks, finally verifying that we’ve properly corrected it once the checksums agree.
I’m implementing this system in C, and am going for an approach similar to what rsync (the program) uses: there will be a single executable, which can be run in client or server mode, and which can be tunneled over a SSH connection. I’m looking at using Tom’s quagmire for the build system, even, and while that’s still not very mature (I have some changes I’ll make into patches soon), it’s already good enough to build arrow, and very easy to modify.
The current status is that I have the block storage layer mostly implemented. I’ll try to blog more about the progress, and it will be released as free software when its done. Right now I’m using some private code for Reed-Solomon* computation, so I don’t want to make it public yet.
* Also, if you know of a good, free Reed Solomon or other error-correction code library, let me know. I haven’t found any that don’t completely suck, aside from some code being privately developed at UCSC.

Loading...
Seo Sanghyeon | 27-Apr-08 at 5:33 pm | Permalink
Is GFLIB any good?
http://www.cs.utk.edu/~plank/plank/gflib/