loading twitter...

Arrow’s source

The source code for my master’s project is now available via a read-only git repository at http://git.metastatic.org/readonly/arrow.git. If you’re interested in it, you can run:

git clone http://git.metastatic.org/readonly/arrow.git

and start playing with it. If you just want to browse the code, you can try out the gitweb interface via http://git.metastatic.org/. That lets you download a tarball of the current code, if you don’t want to clone the project.

Note that I’ve removed parts of the code that compute the Reed-Solomon erasure codes, since the licensing terms for that library aren’t clear right now.

It’s somewhat unstable right now, in that you really need to know what you’re doing to produce results, instead of breaking things. The source is available under the MIT/X11 license, though parts of it does use code licensed under the LGPL.

Uncategorized

Comments (0)

Permalink

Efficient and safe data backup with Arrow

My Master’s project report, aka Technical Report UCSC-SSRC-08-02.

Uncategorized

Comments (1)

Permalink

Arrow: first results

I’ve completed enough code with Arrow to get some results for local backups. Eventually, I want to be able to do remote backups over an SSH tunnel, similar to how rsync works, but this is a good start on trying out the idea. To run this test I wrote a simple program that backs up single files, specified by commands read from stdin, and a Python harness that drives the backup.

This first test used all Linux kernel versions in the 2.6.23 series; that is, every kernel version from 2.6.23 to 2.6.23.17. This table shows the results:


Patch Patch Size Source Size Backup Size Time Files Delta
0 0 252,065,213 283,107,750 451.08 22,530 0
1 3,218 252,065,491 283,119,871 91.83 2 12,121
2 34,402 252,066,473 283,208,456 89.34 18 88,585
3 65,218 252,072,178 283,362,188 102.47 53 153,732
4 94,870 252,073,453 283,591,942 143.06 70 229,754
5 127,389 252,077,945 283,912,484 181.44 81 320,542
6 182,957 252,082,973 284,389,157 193.47 115 476,673
7 187,496 252,084,315 284,831,665 220.82 113 442,508
8 188,895 252,084,342 285,277,506 229.20 107 445,841
9 221,604 252,087,238 285,811,410 238.36 132 533,904
10 284,301 252,091,147 286,524,145 240.94 181 712,735
11 283,078 252,091,071 287,204,874 254.57 175 680,729
12 281,040 252,090,640 287,865,620 256.07 158 660,746
13 283,099 252,090,838 288,510,953 255.74 254 645,333
14 283,830 252,090,842 289,146,161 257.06 146 635,208
15 541,078 252,147,239 290,212,238 265.04 228 1,066,077
16 541,304 252,147,258 291,108,027 267.25 217 895,789
17 555,389 252,150,368 292,006,750 277.29 218 898,723

I’m fairly happy with the way it performs, and with the savings between versions. Of course, the savings aren’t optimal, and there is a significant amount of overhead with all the hashes and error-correction codes stored along with the data, but I’m satisfied that it doesn’t inflate the space used significantly more than the plain text-based diff. Also, the actual disk space used by the chunks is significantly larger than the space above, since each block is pre-allocated, and many of the slots are empty (this could be optimized, by allocating the keys and headers ahead of time, but allocating the space for the chunks on an as-needed basis).

Next I’m going to try implementing the SSH tunneling, at least, a prototype of it. Counting the network bandwidth used in a network backup is the next test.

Uncategorized

Comments (0)

Permalink

Arrow: Base-64 on the file system

Arrow stores most everything as files. Blocks, which can store multiple chunks, identified by the chunk’s hash code, are stored in sequentially-numbered files, starting from zero and going to whatever (internally, this is a 64-bit unsigned integer, so it can’t be more than 264, but I don’t think that will be an issue ;-) . Version files, lists of chunks that comprise the underlying file, are identified by a random GUID.

I initially made all these file names base-64, with a minor modification: instead of ‘/’, I used ‘*’. This gave me something I can store on sensible file systems, it doesn’t waste too much space with the encoding, and it’s really simple to encode and decode. The trouble happened when I tried running arrow on Mac OS X, which by default has a case-insensitive file system [1]. Arrow failed pretty spectacularly when it tried writing more than 26 blocks into the store.

I could have rewritten the storage parts to use something like hexadecimal, but instead I just replaced the lower-case letters with other letters from the ASCII alphabet. It just so happens that if you avoid control and space characters, and characters that can’t be a part of file paths (/ and .), you have just enough characters to cover the 26 you can’t use! I know I could have used space characters, or characters between 128 and 255, but I opted to stay in ASCII, and came across an alphabet that works:

ABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'(),-:;<>?@[~]^_`{|}0123456789+*

I thought it was pretty neat that ASCII had just enough to cover what I needed.

I’ve put some of the code up on-line. I don’t have permission to publish all of it, but I’ll see if I can get it.

1. You can format HFS+ as case-sensitive, but this tends to not work well in practice. There’s still lots of legacy issues.

Uncategorized

Comments (0)

Permalink

Arrow: File Metadata

I’ve been struggling a little with Arrow recently, trying to make progress. Since the chunk storage layer is nearly complete, the next part is the file metadata layer, which we will use to store the actual information about files backed up.

Continue Reading »

Uncategorized

Comments (0)

Permalink