Robust Linux Backup Scheme

This is a description of my backup system for Linux. My goals are:

Maximum one day lost for important data, even if my house burns down
Maximum one hour lost for important data, as long as the house doesn't burn down, but even if the server machine fries everything inside
Recovery of all data should be possible, regardless of the cause of corruption
Corruption of user or system files should be detected as soon as possible to avoid their spread

I use a combination of tools and methods to achieve my goals.

Main system considerations

There are three changes I made to the main system in this regard:

Create a Linux software RAID. If you're not familiar with RAID, one of the modes (called RAID-1) uses two disks as one: you see a system device that looks like one disk, but everything on the device is actually identically on both devices. The obvious advantage is that if one of the hard drives dies, you still have everything on the second one. If things work right, you just put in a new drive, reboot, and the new drive is synchronized with the remaining good drive. It's important to realize that this only protects you against hard drive failure. Other kinds of corruption will be happily mirrored to both devices in the RAID. This is probably the least important component in a data loss prevention strategy. Get a good backup strategy working first, then think about RAID.
Use Linux LVM. LVM is an abstraction layer between physical devices like hard drives and partitions. LVM is extremely powerful, but for robust backup, the important thing is that it allows you to take "snapshots" of your disk frozen in time, so that you don't get backup corruption due to files changing in the middle of the backup.
Use an old laptop as a reserve server. My phone service is through my Linux server using asterisk, so if it dies I need a backup until I get a replacement. I have an old laptop with a 40 gig drive that has /home, /etc, and /var mirrored from the server once an hour with rsync. If the main server fails, this old machine can take its place (although much more slowly). Even 300 Mhz is plenty for an Asterisk server, so old and slow is fine. Call it poor man's high-availability.

Backup Groups

I split my data into three groups, each of which has a different on-line backup strategy:

Important user files, such as my research software, publications, correspondance
Data store: music, videos, books, software
Less important user files, system configuration files

Important User Files

All important user files are stored in Subversion, a revision control system. If you're not familiar with them, revision control systems are commonly used to keep track of source code or documentation for large projects. Among many other things, you can see all past revisions of the file and also see the differences between any arbitrary revision of the file.

Subversion is built around a database, so you can create a "dump" of it, which is a plain text representation of a database that allows you to recreate the database. Dumps are nice because unlike the normal database files, they are very robust to corruption.

Subversion dumps are then backed up along with the files in the Main User Files section. I don't feel confident that subversion would do well if I did have to rebuild the database from corrupted but fixed dumps. Because this is my most important data, I also keep a mirror with the current revision of all subversion repositories, and this mirror is backed up as described in Main User Files section.

Data Store

The disadvantage of revision control systems and differential backups is that they use several times the space of the thing being backed up. For some files, such as music, videos, and photos, the files rarely if ever change. In this case, a better way is to keep a mirror of these files using a mirroring tool such as Unison. Because the files don't change, there is no need for the ability to recreate past versions. If you use the mirroring tool in interactive mode, you will be able to detect and prevent corruption from spreading to the backup just by paying attention. For example, if the January 1988 photos directory is showing changes that need to be propogated to the mirror, you can stop and investigate what's going on.

Of course, to have a remote mirror you have to have an account and storage somewhere. I found a partner who also runs a Linux server, and we each have the other's USB hard drive attached to our server. If you aren't using encryption (which is impossible with rdiff-backup, but possible with other tools), you obviously have to trust the person running the remote computer.

Main User Files

I backup all files that are important to me that don't fit into the other two categories using a mirroring tool called rdiff-backup. It is a mirroring tool, meaning all the files are kept identical on a remote computer, but in addition, it keeps "reverse diffs", meaning you can go back to any version of a file in the past. It also remembers file ownership, group, and permissions and restores them properly even if the remote side is a user account. It is similar in effect to using a revision control system, but the implimentation is much simpler and resistant to corruption, there is a mirror created, and restoring ownership, group, and permissions works correctly. The main advantage of this over just using a mirroring tool like rsync is this: suppose a file is corrupted today, and tonight rysnc automatically mirrors your files. Now the backup and the file are both corrupt, and you are up the creek.

Another interesting solution to this problem is rsync snapshots using rsync and cp -al. This solution has the advantage that you have very easy access to snapshots in the past: they are just directories with the old files in them. This method is similar in effect to rdiff-backup. The advantage is easy access. I think the disadvantage is that if a small part of a large file (such as a revision control system database) changes regularly, the entire file will be copied many times in the backup. This looks like a very nice method if you don't care about that.

Offline Backup

My experience with tapes over the years is that they are extremely reliable... until you need them, at which time they have somehow ended up completely destroyed. They are also extremely expensive for what you get, so I will only talk about using recordable CDs or DVDs for off-line backup.

Burning an archive to CD has the advantage that there's no worrying about whether you will have a good snapshot back in time, because every snapshot is self-contained. Then the question becomes how to make a CD backup as robust as possible.

First, erase both tar and gzip from your memory. They are the cause of more misery around backups than should be allowed. Because of the way gzip works, a single corrupt bit near the beginning of a file destroys the entire file. If you use tar -z to create your backups, you can lose an entire backup due to a single corrupt bit. Why people do this anyway, I'll never understand. dump is the old-fashioned way to create a reliable backup, but its huge disadvantage is that it has problems with mounted filesystems. This can be resolved using LVM snapshots, but there is a much better way. I think the best way is to use the cpio file format. This format uses uncompressed, ASCII text headers. It turns a filesystem, which is a complex data structure, into a clean linear stream that is easy to recover if things don't go right.

afio is an archiver that uses the cpio format, but has the advantage that it can transparently compress files. The difference between afio and tar with compression is critical: tar -j creates an archive then bzip2s it, while afio bzip2s files and adds them to the archive. With tar -j, you're likely lose the whole archive past the first corruption, but with afio you just lose the files that are corrupted. It is best not to use compression, because compression always amplifies corruption loss. If you must use compression, use bzip2 combined with afio. Give bzip2 the -1 option, and you will only lose 100K of your archive surrounding each area of corruption. Bzip2 also has recovery tools in the standard distribution, while gzip does not.

When you burn the cpio/afio archive to disk, burn the archive directly to disk, don't use a filesystem. In other words, just use cdrecord, don't use mkisofs piped into cdrecord. For a backup, a filesystem gains you nothing. When reading a CD with corruption, dd_rescue can be used to recover everything except the corrupt areas of the CD. Without a filesystem, it's very straightforward to rescue a corrupt CD with a cpio archive. dd_rescue lets you know which parts of the file are corrupt, and you can then chop those files out of the cpio archive.

A script to simplify creating afio backups is in the scripts section.

Verification

When you update a mirror or burn a CD backup, how do you know that the data on the other end really got through unadulterated? Especially in the early days of recordable CDs, I often observed corruption where the CD would burn without errors, yet when compared to the original files there was corruption everywhere. Also, most mirroring tools by default don't actually check the contents of the files to make sure they match, they just check the timestamps to see if there are new modifications. Normally this is good, because it is much faster, but the disadvantage is that if one of the sides of the mirror starts to degrade, you won't know about it.

I created a simple script called mirror-verify around rsync that reports files that exist on one side of the mirror and not the other, and also verifies that all files that exist on both sides of the mirror match using their checksums. It is in the scripts section.

Scripts

A note on file selection for backups: every single *#@$ utility uses its own unique file and directory pattern syntax. The way around this is to use good old Unix find to generate the list of files you want to backup. It will never let you down, and all archive utilities worth anything will support a null-terminated filelist generated by find -print0. I have switched all my scripts to use this method.

backup-afio - a wrapper for afio to simplify offline afio backups
mirror-verify - verifies backups using checksum
backup-tree - creates an rdiff-backup of a directory tree
mount-snapshot - mounts or unmounts a LVM snapshot

My personal backup scripts (with a few incriminating details removed):

cron.d-backups is the file in cron.d to run the online backups, verification, freshness checks, and offline backups
backup-daily is the daily online backup script, with the backups being performed by rdiff-backup
backup-offline is the offline backup script, with the backups being performed by backup-afio
backup-homedir.filelist is the shell script that generates the filelist used by backup-tree in backup-daily above

Bottom Line

Compared to a normal user system with a single hard drive, my scheme has these additional material costs:

Extra hard drive for RAID-1	$125
USB hard drive enclosure plus another hard drive	$175
Old laptop with large hard drive	$250
TOTAL	$550

While this is a lot of money, it is well worth it in my opinion. I have had my share over the years of lost work horror stories, and heard stories from many other people. With a system like this, you can stop worrying about data loss, and move on to worrying about more important things you can't do anything about, like fast-moving enormous meteors. I'm heading out to watch for them now.