Backups, A Home-Grown Solution
filed in Server on Jun.17, 2007
For my new position at The Planet, I am responsible for the backups of our internal systems. For this, we use a powerful, easy to manage, reliable, redundant, super-slick product that does the job very well. But for my personal machines, I have to take a slightly differant approach. You see, I don’t have a high-value budget with which to plan a backup strategy. I have only my wits and an offsite file server.
Requirements
Inspired by the professional backups systems I deal with at work, here are the requirements for my home-grown solution:
- Remote file dump is encrypted
- Remote files are small enough to burn on a standard CD-ROM
- Files never live on disk unencrypted - ever
- Backup files should live ONLY on the remote server
- The whole system should be backed up
- Only commonly available tools should be used
- Backups should be quick and automated
- Reconstruction of a backup should take less than an hour
- Support full, differencial, and incremental backups
- Simple to redeploy on another host
While to me, this list represents only the bare minimum, other aspects of a backup solution are just as important. I list them below even through my script does not take them into consideration. (After all, I have to leave something for myself to do the next time I get bored.)
- Automatic failure notification
- Logging sucessful completion of all backups
- Gracefully handle missed backups
- Support multiple simultaneous destinations
Alternatives
After taking a good, long look at the state of backups in the free software world, I came to the conclusions that NOBODY wants to do all the things I want to do in a single package. The closest thing I found was duplicity. Unfortunately, after investigating duplicity’s storage format (cool as it was), that wound up being the fatal shot. Duplicity keeps backups in split tar archives with an internal directory structure maintaining the metadata required to keep up with duplicity’s features. What I was looking for, were simple tarballs that I knew I could dissect easily with standard tools on your typical LiveCD.
Other solutions emerged as well, not the least of which was the product offered by my own employer. Hoever, with such limited funds and no real business incentive to protect this data with a commercial solution, I decided that the commercial route was not for me. (This is after all, just a collection of personal servers - data loss would be merely “inconveniant”, not “devastating”.)
It seems that the majority of free software backup solutions were either centered around a locak backup thatt I could the rsync back to the file server, provided no vacility for encryption, or used obsure formats that I couldn’t either understand or couldn’t trust myself to reconstruct from a LiveCD. I didn’t want to store a local copy of the backup and rsync it over because local storage on my wimpy server is precious and I don’t want to waste it on backups. I didn’t want to simply pipe a full backup every day to the fileserver because I neither have the time nor the bandwidth to push that much data every hour! And of course there is no point in a backup system if you can’t reconstruct the backups easily.
What I wanted was some sort of rsync based solution that cached the signatures locally and encrypted everything before shipping it off to the remote host. All the while, nothing should be written to the local server in the process. This solution was simply not available, as far as I could tell.
So finally, I resolved to build this thing from scratch. Using the rsync library as a base I wrote a rather bulky python script to do the job. It was so ugly though, using so many completely differant modules, that the glue holding it together confused me enough to re-write it three times. Giving up on python, it turns out the the simplist solutions is typically the best. Who knew? A small bash script would fit the bill perfectly.
The Solution
Here’s the overview. The main backup command, tar, takes a backup of the whole system according the the options given in the script. The resulting tarball gets passed over standard output to tee who splits it two ways. One direction goes to rdiff (an application interface to the rsync library) and generages a signature based on the backup. This signature can be used later to generate differentials. The second stream the tarball follows is encrypted by gpg, then passed over ssh to split who breaks the final backup into chunks small enough to fit on a regular CD-ROM.
If a differential is needed instead of a full backup, the tarball is passed through rdiff again (along with a previously generated signature) to filter out the unchanged bits before being passed to gpg, ssh, and split.
Scheduling
Here are the three scheduling scripts I use. The weekly, daily, and hourly scripts are executed by root via the crontab
/usr/local/sbin/smartback-weekly
#!/bin/bash
/usr/local/sbin/smartback /backup/secret \
new /backup/full.tar.sig \
backup/`hostname -s`/`date +%Y%m%d_%H`_full.tar.gpg
cp /backup/{full,hourly}.tar.sig
/usr/local/sbin/smartback-daily
#!/bin/bash
/usr/local/sbin/smartback /backup/secret \
/backup/{full,hourly}.tar.sig \
backup/`hostname -s`/`date +%Y%m%d_%H`_diff.tar.gpg
/usr/local/sbin/smartback-hourly
#!/bin/bash
/usr/local/sbin/smartback /backup/secret \
/backup/{hourly,hourly}.tar.sig \
backup/`hostname -s`/`date +%Y%m%d_%H`_incr.tar.gpg
Let me break this down. First, we’ll take a look at the scripts executed by the crontab. The basic format for the smartback script is:
smartback secret_file old_sig new_sig remote/path/for/backup.tar.gpg
secret_file holds the plaintext copy of the secret symmetric passphrase used to encrypt the backups via gpg. Since this is your password to unlocking the backups, it’s absolutely critical that you never, ever lose this password. Since it’s a password, not a private key (which you could use instead), you either memorize it or write it down. That’s up to you.
old_sig is the signature you want to base this differential backup on. If you want a primary backup instead, simply put “new” here.
new_sig is the new signature that will be generated off of the current state of your system during the execution of this backup. You can used /dev/null here if you don’t want to keep the signature, but most backup strategies will be keeping these signatures.
And finally, the remote path to save the files to will be (according to the current version of this script) be passed to split as the basename of the backup files on the remote host.
My particular schedule for these scripts looks like this in my crontab.
# Full every week
# Diff against full every day
# Incremental every hour
0 0 * * 0 root /usr/local/sbin/smartback-weekly
0 0 * * 1-6 root /usr/local/sbin/smartback-daily
0 1-23 * * * root /usr/local/sbin/smartback-hourly
This runs a full backup at midnight, every Sunday. This full backup generally takes about an hour and a half. The daily (or differential) script runs once every day. This is a diff based on the most recent full backup. The hourly script, of course runs hourly. These are incrementals based on the most recent hourly backup.
So to restore, I would retrieve the most recent full backup, the most recent daily (differential) backup, and all of the hourly (incremental) backups after that. Then, I just patch them together in order and extract the final result.
Local Files
The supporting files that this requires are all maintained (on my system) in /backup. This directory won’t be particularly large. Here are the files I have living in /backup.
/backup/full.tar.sig
This file contains the signature generated by the weekly backups. It is the basis for comparison by the daily backups and is only updated at the beginning of the week.
/backup/hourly.tar.sig
This file contains the signatures generated by the daily and hourly backups. It is used as the basis for the hourly incrementals.
/backup/most-recent-index.log
This is a log of the files touched by tar during the last backup run. I find it useful to keep around.
/backup/secret
This is a plaintext file containing the secret key used by gpg as a passphrase for symmetric encryption of the backups before they are sent to the remote host.
/backup/.ssh/id_rsa
This file is used by ssh as the RSA private key used for authentication to the remote host. It allows the backup script to login without manual user intervention. If you don’t know what ssh private keys are, you really need to lean that before trying to use this script.
The /backup directory on my server is readable only by root. Take extra-special care of that secret file and the ssh private key.
The Script
/usr/local/sbin/smartback
#!/bin/bash
# parameters
SECRET=$1;OLDSIG=$2;NEWSIG=$3;BACKFILE=$4# basic options
$BDIR=’/backup’
ME=’smartback’# commands
TNEWSIG=”$BDIR/.tmp_$ME.sig”
TARCMD=’tar –create –preserve –verbose ‘
TARCMD=”$TARCMD –ignore-failed-read –one-file-system ”
TARCMD=”$TARCMD –exclude backup ”
TARCMD=”$TARCMD –index-file=$BDIR/most-recent-index.log ”
TARCMD=”$TARCMD –directory / . ”
SIGCMD=”rdiff — signature - $TNEWSIG ”
DIFCMD=”rdiff — delta $OLDSIG - - ”
ENCCMD=”gpg –no-tty –passphrase-fd 9 -o - -c - 9< <(cat $SECRET) ”
SSH=”ssh -T -i $BDIR/.ssh/id_rsa -o StrictHostKeyChecking=no ”
SSH=”$SSH backupuser@remotehost.mysite.org ”
SSHCMD_SPLIT=”$SSH ’split -d -b 629145600 - $BACKFILE.split_’”# check parameters
if [ "$4" = "" ]; then
echo “Bake a primary: $ME <secretfile> new <newsig> <backupfname>”
echo “Make a diff: $ME <secretfile> <oldsig> <newsig> <backupfname>”
exit 1
fi# one instance of this at a time
LOCKFILE=”$BDIR/.$ME.pid”
if [ -e $LOCKFILE ]; then
LOCKID=`cat $LOCKFILE`
PS=`ps ax|egrep “^\w*$LOCKID”`
if [ "$PS" = "" ]; then
echo “Removing stale lockfile.”
rm -f $LOCKFILE
else
echo “$ME process $LOCKID is already running!”
echo “If this is false, please run the following command:”
echo ” rm $LOCKFILE”
exit 2
fi
fi
echo $$ >$LOCKFILE# setup command chains
RUN_NEW_CMD=”$TARCMD|tee >($SIGCMD) |$ENCCMD|$SSHCMD_SPLIT”
RUN_DIF_CMD=”$TARCMD|tee >($SIGCMD) |$DIFCMD|$ENCCMD|$SSHCMD_SPLIT”# run either a primary (new signature) or
# run a diff against an existing signature
if [ "$OLDSIG" = "new" ]; then
bash <(echo “$RUN_NEW_CMD”)
else
bash <(echo “$RUN_DIF_CMD”)
fi# mv the new signature out of a temp file to specified location
cat <$TNEWSIG >$NEWSIG;rm -f $TNEWSIG
rm -f $LOCKFILE# summary
echo “Backup complete!”
echo “Based on: $OLDSIG”
echo “Signature: $NEWSIG”
echo “Remote file: $BACKFILE”
Now for the part you programmers have been waiting for. Here comes your opportunity to sit around the proverbial campfire and discuss my sketchy scripting and how you would have done it better. ![]()
Variables
The first significant part of this script defines the variables that will be used later.
SECRET is the path to the plaintext file holding the secret passphrase needed to perform the symmertrical encryption with gpg.
OLDSIG is the path to the signature we will be using to generate a differential backup. If this variable is “new” however, we will generate a full backup instead.
NEWSIG is the path where the signature generated by this execution of the backup will be stored. This signature can be used to perform future differential backups.
BACKFILE is the path on the remote server we will be passing to split to place the final backup files.
TNEWSIG defines the filename used to temporarily store the signature based off of this current run. This file will be moved ontop of NEWSIG at the end.
TARCMD of course is the command we will be using to generate the backup in the first place. This is just you normal tar command. Don’t use compression here! Compressions negatively affects the differential calculations and gpg will compress your backups anyway before encrypting them.
SIGCMD is the rdiff command used to generate the signature for this execution of the script.
DIFCMD is the rdiff command used to filter out the unchanged portion of the backup based on a previously generated signature. This variable is not used for full backups.
ENCCMD is the gpg command used to compress and encrypt the backups before being sent off to the remote host. Notice how carefully file descriptor redirection is used to keep the passphrase out of the process listings at runtime.
SSH stores the general options for connecting to your chosen remote host.
SSHCMD_SPLIT takes the SSH variable and expands it by adding on the split command which will be executed by the remote host and will hop up the stream of data into manageable sizes.
Lockfile
The next significant portion of this script is the lockfile handling. This is just your basic lockfile check, but it protects us from having hourly backups clobber the full backups if the fulls take longer than an hour to complete.
Execution
Finally, we check whether this execution is for a full backup or differential and execute the appropriate commands. Basically, the rdiff delta is left out of the mix if it’s a full backup. if it’s a differential, the rdiff delta is included. Simple.
Recovery
As you can see, this is a very basic script with a a little bit of bash magic. Something could easily go wrong. Since I haven’t built in any good reporting, error handling, or verification (shame on me) I have to manually inspect the backups from time to time. My habit, over the past few years (especially before this script) has been to move data off of the fileserver and onto an external hard drive once per week. Occasionally, when I visit my family on the other side of Dallas/FtWorth, I will exchange this external hard drive with one stored there. Roughly once a month or any time I modify the script, I test a full restore.
There is no substiture for a full test of your recovery strategy. For anyone who actually decides to use a script similar to this one to protect their systems - here is an outline of how to recover your data.
The backup directory will look something like this after a few backups:
20070617_00_full.tar.gpg.split_00
20070617_00_full.tar.gpg.split_01
20070617_00_full.tar.gpg.split_02
20070617_00_full.tar.gpg.split_03
20070617_02_incr.tar.gpg.split_00
20070617_03_incr.tar.gpg.split_00
…
20070620_23_incr.tar.gpg.split_00
20070621_00_diff.tar.gpg.split_00
20070621_01_incr.tar.gpg.split_00
20070621_02_incr.tar.gpg.split_00
20070621_03_incr.tar.gpg.split_00
20070621_04_incr.tar.gpg.split_00
20070621_05_incr.tar.gpg.split_00
20070621_06_incr.tar.gpg.split_00
20070621_07_incr.tar.gpg.split_00
If the failure occured at 7:30 on June 21, then this is the basic form for a recovery. (Have plenty of hard drive space available.)
cp 20070617_00_full.tar.gpg.split_* 20070621_* /target/
cd /target/
cat 20070617_00_full.tar.gpg.split_{00,01,02,03} >full.tar.gpg
gpg -d full.tar.gpg >full.tar
for f in 00 01 02 03 04 05 06 07; do
gpg -d 20070621_$f_* >$f.tar
rdiff patch full.tar $f.tar new.tar
mv new.tar full.tar
done
tar tf full.tar
While this is not the most drivespace friendly demonstration of recovery, it is much easier to read than the obscure layers of pipes one might alternatively use. Basically, we just grab a copy of the most recent full backup, the most recent differential, and any incremental backups we need to bring us from the differential to now. The first thing on our todo list is to reconstruct the files split pulled apart. The cat command works just fine for that. Then we decrypt the full backup and name it full.tar. Next we simply loop through each of the partial backups, in order, one-by-one decrypting them and using rdiff to patch them with full.tar into a newly updated file, new.tar. The we move new.tar on top of full.tar because new.tar is now our most up-to-date full backup and loop on to the next partial backup repeating the process. At the end, we have a fully up-to-date full.tar and can list the contents with tar.
Don’t Sue Me
Of course, I’m sure there are some problems in these scripts, not everything will have been considered. For example. I don’t know how these scripts handle paths with a space in them. I never bothered with them. Also, if the connection to the remote host breaks - your hourly signature gets written anyway making your incrementals up until the next differential irrelevent. A rolling hash locally and a comparison on the remote end would help to identify these situations, but as far as I can tell, they don’t happen often. If you are worried about that, don’t use incrementals - just fulls and differentials. That will help to limit the damage caused by network drops.
Also, please never forget that a tarball alone is seldom sufficient to fully recover a machine. Metadata such as partitioning schemes, hardware, and installed packages will often be needed for a full reconstruction. Databases and some email solutions may need dumps of their information as well for guarantee a consistent backup. Applications with backup requirements related to the open files will often have some form of online procedure for securing a consistant backup. I typically run these a few minutes before the hourly backup.
This information is provided without warranty or endorsement of suitability for any particular purpose. I use this script every day and it doesn’t cause me any trouble - but if your data gets corrupted, don’t blame me for it. Also, I may have introduced typos or bugs while modifying this script for presentation in this entry, for all I know, there may be a typo in there that’s not even runnable. How you customize, reimpliment, or fix this script is up to you. More than anything, I just wanted to share my solution to what I could only imagine is a relatively common problem. The contents of the script above are donated to the public and may be downloaded, copied, modified, or used in any other way without attribution, compensation, or my permission.
If you have any suggestions, corrections, flames, or comments my contact info is clearly indicated on this site. However, at this time, I have been getting well over 15,000 spam messages per month. So, on the off chance that you wind up in my junk mail deleted or ignored - don’t get your feelings hurt. I didn’t do it on purpose.
Enjoy.
Leave a Reply
You must be logged in to post a comment.