Arrow: File Metadata

I’ve been struggling a little with Arrow recently, trying to make progress. Since the chunk storage layer is nearly complete, the next part is the file metadata layer, which we will use to store the actual information about files backed up.

For the past week or so, I’ve been batting around ideas for this, and I think I’ve finally hit on the way to implement it. Files are, of course, only lists of chunk references, identified by two hashes of the chunk’s value, a simple, rolling checksum, and an MD5 hash (for very small runs of data — less than the size of a hash identifier — we’ll store the bytes directly in the file). We’ll use these hashes to perform an rsync-like upload for new file versions — the backup client gets this list of hashes, and figures out how to reuse common blocks between the old and new file versions, and finally only uploads chunks that aren’t in the old file.

So, we need a way to preserve the metadata of the backed-up files, which means file names and hierarchies. My first thought was to just have two kinds of files: metadata files, which store the hash references, and directories, which just store lists of files. Every file or directory is identified by a unique identifier (a UUID), and the root directory of the backup is simply identified by the null UUID. Directory entries would contain two values, the UUID of the file, and the file’s name. It would reimplement all the work the file system already does, but that way we could have versioned files and directories. This seems like a benefit; if you do a periodic backup, you essentially have both changes in files, and changes in the directories those files are in, and since files in the same directory are likely related, reverting to a snapshot of an entire directory might make more sense than reverting file-by-file.

Anyway, that’s the idea, but it’s a pain to implement. I’m quickly running out of time, and I need to get something working in the next couple of weeks. So, instead, I’m going for a different option.

The idea is that I’ll just use directories as directories, so the metadata backup will mirror exactly the source file hierarchy. The difference is that instead of storing files, we’ll store each regular file as a symbolic link to the “head” version of that file’s metadata, which is referenced by a random UUID. All the file names will be identical, and we’ll use the existing logic the system provides for looking up files by name. We’ll run into issues if you, say, delete a file then make a directory with the same name, but we’ll punt on hard issues like that for now. I want to be able to demonstrate the idea, and test it, so I’ll have some results to show.

One other thing I’m going to do to save time is punt on network backups. The initial program will just copy files to a local repository, so I won’t have to implement any network transport, even the simple transport I had planned to tunnel over SSH.