| << | April, 2006 | >> | ||||
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
| 26 | 27 | 28 | 29 | 30 | 31 | 1 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| 23 | 24 | 25 | 26 | 27 | 28 | 29 |
| 30 | 1 | 2 | 3 | 4 | 5 | 6 |
Between photos of Annie, other photography and music, I have quite a bit of data. Probably over 300GB now and growing fast.
Backing up this much data is a challenge. I've set some requirements of for what I needed in a backup system:
This is a pretty tall order. Here are some things that I considered but decided not to do:
I ended up going with a more brute force solution. I bought and set up two Linux servers. The first is a home server. The second is in a datacenter in downtown Seattle.
The home Linux server is a dual core Pentium running Ubuntu 5.10 server. I have a HW/SW raid card driving 5 250GB drives in RAID 5. It servers a data share over Samba. I backup that share to another directory every night. That archive directory is shared out via Samba also, but as read-only. This makes the situation fairly virus proof. Since I have two copies of the data (one that is r/w and the other that is a backup) I might have to add more drives in the future. I got a big honking case so I'll have room. I'm running slimserver on this amoung other things. I also use it to do long running batch enblend stitching jobs for panoramas.
The data center machine is a 1U Tyan server with 4x320GB drives in Linux SW RAID 5. It is also running Ubuntu 5.10 server. (Getting Ubuntu installed on a RAID5 drive array was a challenge. I don't remember the exact steps I took or I would document them here.) The cost of doing this can be high unless you have a friend that can get you hooked up. Even if you can't, it might be worthwhile. I'm backing up to this every night over my cable modem. The cable modem upload speed is okay since I upgraded to Comcast's higher level of service. I now have ~768kbps up.
There are a couple of choices for software for doing the backups. The key is that it has to be bandwidth smart (only update diffs), handle versioning gracefully and be able to do partial copies based on a timer. The last requirement is so that if I have a ton of data to upload it can go over a couple of nights. The most obvious candidate for this is rsync. rsync is an amazing tool for these types of things. My MS friends think that robocopy is cool -- it can't hold a candle to rsync. It also generally operates over ssh so it is also secure. It can also build version trees where unchanged files are hardlinked to previous versions. The only thing it can't do is handle files that have moved but haven't changed, and stop itself after a certain number of minutes.
To solve this problem, Scott Ludwig (a coworker of mine who has a similar setup) developed a python script that does much of what rsync does but solves these two last problems. It is called "link backup" and you can get it here. Every night starting at 11:30pm, I run this backup script to backup my working directory to the backup directory on my home server. This is usually pretty quick. I then backup the latest snapshot of that backup set to the server in the datacenter. This can take a little longer but at least I don't have to think about it.
Since I implemented this, Amazon's S3 has come on the scene. While it might be interesting to back up using S3, I'm not sure how the economics work out. At my current level (300GB) and Amazon's pricing model ($0.15 per GB per month) I would be paying $45 per month. If I grow to 600GB, I'm up to $90 per month. Bandwidth is extra, but I don't use much of that. It should be easy to find colocation hosting for that amount of coin. There is the one time cost of the hardware and the work of keeping it up to date also (I think my server came in around $1600 but I probably overbought). The advantage is that you can run and do other things with your own server in a datacenter.
[April 16, 2006: Edited to fix Ubuntu version number.]