Mon, 22 Dec 2008

Backups

I had a dream last night that the apartment beneath ours caught on fire, we had to rush out of the building, and my computer and all of its data was destroyed.

I've been pondering a formal backup system for a while now. (My current system involves making sure important files are in a version control system and exist on at least my laptop and desktop. This is pretty ad-hoc, inconsistently updated, and not entirely comprehensive.) I'm taking my dream as impetus to actually set something up. This post is to help me organize my thoughts and see if anyone has any comments or suggestions.

My Requirements

I want to have a full rotating versioned backup system, where I have complete daily backups for a recent time span (say a week or so) and more sporadic backups back to as much as a year in the past. Ideally, the backups should be stored in a space-efficient manner so unchanged files don't take up more space than a single copy would require. The backups should have off-site redundancy. They should be relatively easy to use; they should be fully automated on a day-to-day basis, with notification when things go wrong. Ease of setup would be nice but not necessary.

My Data

I currently have about 720 GB of data in my home directory, plus a few hundred MB elsewhere on the computer that I'd want to back up. I also have about 11GB in a bzr repository, but all of that should remain duplicated in my home directory. Most of the data in my home directory is in media files that I can either replace (rerip CDs, etc.) or live without; only 25 GB of it is stuff that I must back up. (A further 130 GB is stuff that would be nice to back up, but I can just burn it to DVD and consider those my backups; the data is essentially static.)

JWZ Backups

The easiest approach is the JWZ backup solution. For all of my data, that would be two 1 TB external hard drives, for about $220. If I restrict myself to the "must backup" data, I could make do with two 60 GB external hard drives for about $80. In either case, I'd keep one drive at the office and swap them periodically.

The advantage of this approach is that I control everything. I can put encrypted volumes on the drives, so if they get lost or stolen, my data isn't usable to other people. I can use rsync with hardlinks between datestamped directories to get versioned backups with efficient disk usage. The drawbacks are a modest initial monetary outlay and the need to coordinate shuttling drives back and forth.

Amazon S3

Another approach is to use Amazon S3 to store my data. It's offsite by definition (and stored among multiple data centers; if I write data to it, I can reasonably trust that I'll get that data back). It's not too expensive: at $0.17/GB-month, my minimal backup will cost about $3.85/month. Throw in transfer costs and churn, and I doubt I'd exceed $6/month. (The initial upload would be $2.56. A full restore would cost me $4.36.) With S3, I would only back up the minimal data; the 130 GB of optional backups would cost an additional $20/month, which would exceed the cost of the full do-it-myself hard drive backups in one year.

The complication to S3 is that it's just a web-based data storage service; you need additional software to make a reasonable backup solution.

Jungle Disk

From everything I've read, Jungle Disk is currently the best software for storing filesystem data on S3. It runs on Windows, Mac OSX, and Linux, and exports your S3 buckets as a WebDAV disk, which you can then mount and treat like an ordinary (unlimited capacity) disk drive. All data is encrypted before it's sent to S3.

I like this approach. Since it looks like a disk, I can use the same rsync setup I would with my own disks, and since the data is encrypted, I don't need to worry too much about it being transported over the Internet and stored on someone else's servers. The main drawback is that it's proprietary software. In addition to my principled preference of open source software to proprietary, there's also the issue that, especially because the data's encrypted, this software would be my only access to my backups. If something went wrong and I couldn't get support from the company (e.g. they went out of business), I'd be out of luck.

The software costs $20. Assuming $5/month on S3, it would take one year for this approach to cost more than the minimal get-my-own-disks approach.

Other S3 software

I haven't seen anything else that will let me back up to S3 and keep versioned backups in a space-efficient manner. Most of the S3 backup software I've seen doesn't do versions, and the few that do don't appear to do it space-efficiently. As always, I have the option of writing my own, but that would take a fair amount of time and effort, and I'd be likely to give up partway through, continuing to leave myself without good backups.

Conclusion

Barring any better suggestions from others, I'm leaning towards the two smallish hard drives. They'd pay for themselves after a year of use, and I get complete control of my data (for better or worse). I like the idea of using S3, but it's more expensive in the long run, and I'm not completely happy with any of the software I've found to use with it.


Phil! Gold