Backing up data

Sat, Apr 15, 2006

Between photos of Annie, other photography and music, I have quite a bit of data.  Probably over 300GB now and growing fast.

Backing up this much data is a challenge.  I've set some requirements of for what I needed in a backup system:

  • Automatic.  A backup system is useless if you don't actually use it.  I know that if I'm required to do anything manually it probably won't get done.
  • Off site.  There are too many things that can happen if you have an on site backup.  Your house can burn down, you can get robbed, etc.
  • Virus resistant. If somehow a virus were to try and wipe everything out, I'd like my data to survive.
  • Versioned.  In case I delete something and don't realize right away, I'd like to keep versions of my data.
  • Big. I have lots of data and I don't imagine that I'm going to stop accumulating it.  I'd like to plan for 800GB-1TB.
  • On line.  I want to be able to get to my backup without mounting/moving anything.

This is a pretty tall order.  Here are some things that I considered but decided not to do:

  • RAID. Mirroring is pretty secure in the face of hardware failure.  RAID 5 sounds great but stories I hear is that if one drive goes there is a serious chance another will go during the rebuilding.  Simply doing RAID also fails to satisfy many of the requirements above.
  • DVD.  I started backing some stuff up on DVDs.  I got read errors a month later.  Too manual also.
  • External Harddrives. I thought about having a couple of really big external drives and rotating them.  This solves a lot of the problems above but it is pretty manual.  A friend of mine is going to do this and keep a copy at the office.  He said he planned to backup/move every couple of months.  That is just too long for me.
  • NT Volume Shadow Copy. This is a cool technology that can keep snapshots through time on a drive.  It looks like the Linux Volume Manager (LVM) can do some similar things.  This isn't really a backup solution as much as a versioning solution.  This plus RAID 0 is probably pretty good except it isn't off site.

I ended up going with a more brute force solution.  I bought and set up two Linux servers.  The first is a home server.  The second is in a datacenter in downtown Seattle.

The home Linux server is a dual core Pentium running Ubuntu 5.10 server.  I have a HW/SW raid card driving 5 250GB drives in RAID 5.  It servers a data share over Samba.  I backup that share to another directory every night.  That archive directory is shared out via Samba also, but as read-only.  This makes the situation fairly virus proof.  Since I have two copies of the data (one that is r/w and the other that is a backup) I might have to add more drives in the future.  I got a big honking case so I'll have room.  I'm running slimserver on this amoung other things.  I also use it to do long running batch enblend stitching jobs for panoramas.

The data center machine is a 1U Tyan server with 4x320GB drives in Linux SW RAID 5.  It is also running Ubuntu 5.10 server.  (Getting Ubuntu installed on a RAID5 drive array was a challenge.  I don't remember the exact steps I took or I would document them here.)  The cost of doing this can be high unless you have a friend that can get you hooked up.  Even if you can't, it might be worthwhile.  I'm backing up to this every night over my cable modem.  The cable modem upload speed is okay since I upgraded to Comcast's higher level of service.  I now have ~768kbps up.

There are a couple of choices for software for doing the backups.  The key is that it has to be bandwidth smart (only update diffs), handle versioning gracefully and be able to do partial copies based on a timer.  The last requirement is so that if I have a ton of data to upload it can go over a couple of nights.  The most obvious candidate for this is rsync.  rsync is an amazing tool for these types of things.  My MS friends think that robocopy is cool -- it can't hold a candle to rsync.  It also generally operates over ssh so it is also secure. It can also build version trees where unchanged files are hardlinked to previous versions.  The only thing it can't do is handle files that have moved but haven't changed, and stop itself after a certain number of minutes.

To solve this problem, Scott Ludwig (a coworker of mine who has a similar setup) developed a python script that does much of what rsync does but solves these two last problems.  It is called "link backup" and you can get it here.  Every night starting at 11:30pm, I run this backup script to backup my working directory to the backup directory on my home server.  This is usually pretty quick.  I then backup the latest snapshot of that backup set to the server in the datacenter.  This can take a little longer but at least I don't have to think about it.

Since I implemented this, Amazon's S3 has come on the scene.  While it might be interesting to back up using S3, I'm not sure how the economics work out.  At my current level (300GB) and Amazon's pricing model ($0.15 per GB per month) I would be paying $45 per month.  If I grow to 600GB, I'm up to $90 per month.  Bandwidth is extra, but I don't use much of that.  It should be easy to find colocation hosting for that amount of coin.  There is the one time cost of the hardware and the work of keeping it up to date also (I think my server came in around $1600 but I probably overbought).  The advantage is that you can run and do other things with your own server in a datacenter.

[April 16, 2006: Edited to fix Ubuntu version number.]