Cumulus: Efficient Filesystem Backup to the Cloud How to Build ------------ Dependencies: - libuuid (sometimes part of e2fsprogs) - sqlite3 - Python (2.5 or later) - boto, the python interface to Amazon's Web Services (for S3 storage) http://code.google.com/p/boto - paramiko, SSH2 protocol for python (for sftp storage) http://www.lag.net/paramiko/ Building should be a simple matter of running "make". This will produce an executable called "cumulus". Setting up Backups ------------------ Two directories are needed for backups: one for storing the backup snapshots themselves, and one for storing bookkeeping information to go with the backups. In this example, the first will be "/cumulus", and the second "/cumulus.db", but any directories will do. Only the first directory, /cumulus, needs to be stored somewhere safe. The second is only used when creating new snapshots, and is not needed when restoring. 1. Create the snapshot directory and the local database directory: $ mkdir /cumulus /cumulus.db 2. Initialize the local database using the provided script schema.sql from the source: $ sqlite3 /cumulus.db/localdb.sqlite sqlite> .read schema.sql sqlite> .exit 3. If encrypting or signing backups with gpg, generate appropriate keypairs. The keys can be kept in a user keyring or in a separate keyring just for backups; this example does the latter. $ mkdir /cumulus.db/gpg; chmod 700 /cumulus.db/gpg $ gpg --homedir /cumulus.db/gpg --gen-key (generate a keypair for encryption; enter a passphrase for the secret key) $ gpg --homedir /cumulus.db/gpg --gen-key (generate a second keypair for signing; for automatic signing do not use a passphrase to protect the secret key) Be sure to store the secret key needed for decryption somewhere safe, perhaps with the backup itself (the key protected with an appropriate passphrase). The secret signing key need not be stored with the backups (since in the event of data loss, it probably isn't necessary to create future backups that are signed with the same key). To achieve better compression, the encryption key can be edited to alter the preferred compression algorithms to list bzip2 before zlib. Run $ gpg --homedir /cumulus.db/gpg --edit-key Command> pref (prints a terse listing of preferences associated with the key) Command> setpref (allows preferences to be changed; copy the same preferences list printed out by the previous command, but change the order of the compression algorithms, which start with "Z", to be "Z3 Z2 Z1" which stands for "BZIP2, ZLIB, ZIP") Command> save Copy the provided encryption filter program, cumulus-filter-gpg, somewhere it may be run from. 4. Create a script for launching the Cumulus backup process. A simple version is: #!/bin/sh export LBS_GPG_HOME=/cumulus.db/gpg export LBS_GPG_ENC_KEY= export LBS_GPG_SIGN_KEY= cumulus --dest=/cumulus --localdb=/cumulus.db --scheme=test \ --filter="cumulus-filter-gpg --encrypt" --filter-extension=.gpg \ --signature-filter="cumulus-filter-gpg --clearsign" \ /etc /home /other/paths/to/store Make appropriate substitutions for the key IDs and any relevant paths. Here "--scheme=test" gives a descriptive name ("test") to this collection of snapshots. It is possible to store multiple sets of backups in the same directory, using different scheme names to distinguish them. The --scheme option can also be left out entirely. Backup Maintenance ------------------ Segment cleaning must periodically be done to identify backup segments that are mostly unused, but are storing a small amount of useful data. Data in these segments will be rewritten into new segments in future backups to eliminate the dependence on the almost-empty old segments. The provided cumulus-util tool can perform the necessary cleaning. Run it with $ cumulus-util --localdb=/cumulus.db clean Cleaning is still under development, and so may be improved in the future, but this version is intended to be functional. Old backup snapshots can be pruned from the snapshot directory (/cumulus) to recover space. A snapshot which is still referenced by the local database should not be deleted, however. Deleting an old backup snapshot is a simple matter of deleting the appropriate snapshot descriptor file (snapshot-*.lbs) and any associated checksums (snapshot-*.sha1sums). Segments used by that snapshot, but not any other snapshots, can be identified by running the clean-segments.pl script from the /cumulus directory--this will perform a scan of the current directory to identify unreferenced segments, and will print a list to stdout. Assuming the list looks reasonable, the segments can be quickly deleted with $ rm `./clean-segments.pl` A tool to make this easier will be implemented later. The clean-segments.pl script will also print out a warning message if any snapshots appear to depend upon segments which are not present; this is a serious error which indicates that some of the data needed to recover a snapshot appears to be lost. Listing and Restoring Snapshots ------------------------------- A listing of all currently-stored snapshots (and their sizes) can be produced with $ cumulus-util --store=/cumulus list-snapshot-sizes If data from a snapshot needs to be restored, this can be done with $ cumulus-util --store=/cumulus restore-snapshot \ test-20080101T121500 /dest/dir Here, "test-20080101T121500" is the name of the snapshot (consisting of the scheme name and a timestamp; this can be found from the output of list-snapshot-sizes) and "/dest/dir" is the path under which files should be restored (this directory should initially be empty). "" is a list of files or directories to restore. If none are specified, the entire snapshot is restored. Remote Backups -------------- The cumulus-util command can operate directly on remote backups. The --store parameter accepts, in addition to a raw disk path, a URL. Supported URL forms are file:///path Equivalent to /path s3://bucket/path Storage in Amazon S3 (Expects the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables to be set appropriately) sftp://server/path Storage on sftp server (note that no password authentication or password protected authorization keys are not supported atm and config options like port or individual authorization keys are to be configured in ~/.ssh/config and the public key of the server has to be in ~/.ssh/known_hosts) To copy backup snapshots from one storage area to another, the cumulus-sync command can be used, as in $ cumulus-sync file:///cumulus s3://my-bucket/cumulus Support for directly writing backups to a remote location (without using a local staging directory and cumulus-sync) is slightly more experimental, but can be achieved by replacing --dest=/cumulus with --upload-script="cumulus-store s3://my-bucket/cumulus" Alternate Restore Tool ---------------------- The contrib/restore.pl script is a simple program for restoring the contents of a Cumulus snapshot. It is not as full-featured as the restore functionality in cumulus-util, but it is far more compact. It could be stored with the backup files so a tool for restores is available even if all other data is lost. The restore.pl script does not know how to decompress segments, so this step must be performed manually. Create a temporary directory for holding all decompressed objects. Copy the snapshot descriptor file (*.lbs) for the snapshot to be restored to this temporary directory. The snapshot descriptor includes a list of all segments which are needed for the snapshot. For each of these snapshots, decompress the segment file (with gpg or the appropriate program based on whatever filter was used), then pipe the resulting data through "tar -xf -" to extract. Do this from the temporary directory; the temporary directory should be filled with one directory for each segment decompressed. Run restore.pl giving two arguments: the snapshot descriptor file (*.lbs) in the temporary directory, and a directory where the restored files should be written.