Arrow: first results
I’ve completed enough code with Arrow to get some results for local backups. Eventually, I want to be able to do remote backups over an SSH tunnel, similar to how rsync works, but this is a good start on trying out the idea. To run this test I wrote a simple program that backs up single files, specified by commands read from stdin, and a Python harness that drives the backup.
This first test used all Linux kernel versions in the 2.6.23 series; that is, every kernel version from 2.6.23 to 2.6.23.17. This table shows the results:
| Patch | Patch Size | Source Size | Backup Size | Time | Files | Delta |
| 0 | 0 | 252,065,213 | 283,107,750 | 451.08 | 22,530 | 0 |
| 1 | 3,218 | 252,065,491 | 283,119,871 | 91.83 | 2 | 12,121 |
| 2 | 34,402 | 252,066,473 | 283,208,456 | 89.34 | 18 | 88,585 |
| 3 | 65,218 | 252,072,178 | 283,362,188 | 102.47 | 53 | 153,732 |
| 4 | 94,870 | 252,073,453 | 283,591,942 | 143.06 | 70 | 229,754 |
| 5 | 127,389 | 252,077,945 | 283,912,484 | 181.44 | 81 | 320,542 |
| 6 | 182,957 | 252,082,973 | 284,389,157 | 193.47 | 115 | 476,673 |
| 7 | 187,496 | 252,084,315 | 284,831,665 | 220.82 | 113 | 442,508 |
| 8 | 188,895 | 252,084,342 | 285,277,506 | 229.20 | 107 | 445,841 |
| 9 | 221,604 | 252,087,238 | 285,811,410 | 238.36 | 132 | 533,904 |
| 10 | 284,301 | 252,091,147 | 286,524,145 | 240.94 | 181 | 712,735 |
| 11 | 283,078 | 252,091,071 | 287,204,874 | 254.57 | 175 | 680,729 |
| 12 | 281,040 | 252,090,640 | 287,865,620 | 256.07 | 158 | 660,746 |
| 13 | 283,099 | 252,090,838 | 288,510,953 | 255.74 | 254 | 645,333 |
| 14 | 283,830 | 252,090,842 | 289,146,161 | 257.06 | 146 | 635,208 |
| 15 | 541,078 | 252,147,239 | 290,212,238 | 265.04 | 228 | 1,066,077 |
| 16 | 541,304 | 252,147,258 | 291,108,027 | 267.25 | 217 | 895,789 |
| 17 | 555,389 | 252,150,368 | 292,006,750 | 277.29 | 218 | 898,723 |
I’m fairly happy with the way it performs, and with the savings between versions. Of course, the savings aren’t optimal, and there is a significant amount of overhead with all the hashes and error-correction codes stored along with the data, but I’m satisfied that it doesn’t inflate the space used significantly more than the plain text-based diff. Also, the actual disk space used by the chunks is significantly larger than the space above, since each block is pre-allocated, and many of the slots are empty (this could be optimized, by allocating the keys and headers ahead of time, but allocating the space for the chunks on an as-needed basis).
Next I’m going to try implementing the SSH tunneling, at least, a prototype of it. Counting the network bandwidth used in a network backup is the next test.

Loading...