Arrow: first results

I’ve completed enough code with Arrow to get some results for local backups. Eventually, I want to be able to do remote backups over an SSH tunnel, similar to how rsync works, but this is a good start on trying out the idea. To run this test I wrote a simple program that backs up single files, specified by commands read from stdin, and a Python harness that drives the backup.

This first test used all Linux kernel versions in the 2.6.23 series; that is, every kernel version from 2.6.23 to 2.6.23.17. This table shows the results:


Patch Patch Size Source Size Backup Size Time Files Delta
0 0 252,065,213 283,107,750 451.08 22,530 0
1 3,218 252,065,491 283,119,871 91.83 2 12,121
2 34,402 252,066,473 283,208,456 89.34 18 88,585
3 65,218 252,072,178 283,362,188 102.47 53 153,732
4 94,870 252,073,453 283,591,942 143.06 70 229,754
5 127,389 252,077,945 283,912,484 181.44 81 320,542
6 182,957 252,082,973 284,389,157 193.47 115 476,673
7 187,496 252,084,315 284,831,665 220.82 113 442,508
8 188,895 252,084,342 285,277,506 229.20 107 445,841
9 221,604 252,087,238 285,811,410 238.36 132 533,904
10 284,301 252,091,147 286,524,145 240.94 181 712,735
11 283,078 252,091,071 287,204,874 254.57 175 680,729
12 281,040 252,090,640 287,865,620 256.07 158 660,746
13 283,099 252,090,838 288,510,953 255.74 254 645,333
14 283,830 252,090,842 289,146,161 257.06 146 635,208
15 541,078 252,147,239 290,212,238 265.04 228 1,066,077
16 541,304 252,147,258 291,108,027 267.25 217 895,789
17 555,389 252,150,368 292,006,750 277.29 218 898,723

I’m fairly happy with the way it performs, and with the savings between versions. Of course, the savings aren’t optimal, and there is a significant amount of overhead with all the hashes and error-correction codes stored along with the data, but I’m satisfied that it doesn’t inflate the space used significantly more than the plain text-based diff. Also, the actual disk space used by the chunks is significantly larger than the space above, since each block is pre-allocated, and many of the slots are empty (this could be optimized, by allocating the keys and headers ahead of time, but allocating the space for the chunks on an as-needed basis).

Next I’m going to try implementing the SSH tunneling, at least, a prototype of it. Counting the network bandwidth used in a network backup is the next test.