From 5260c952146895e6fd522777a41c48db84535964 Mon Sep 17 00:00:00 2001
From: Michael Vrable <mvrable@cs.ucsd.edu>
Date: Fri, 24 Aug 2007 10:24:09 -0700
Subject: [PATCH] Documentation improvements.

Highlights are a README file with instructions for getting started, and
description of some implementation details, starting with the purpose and
format of the local database.
---
 README             | 136 +++++++++++++++++++++++++++++++++++++++++++++
 format.txt         |   4 ++
 implementation.txt |  91 ++++++++++++++++++++++++++++++
 3 files changed, 231 insertions(+)
 create mode 100644 README
 create mode 100644 implementation.txt
diff --git a/README b/README
new file mode 100644
index 0000000..eaf33fd
--- /dev/null
+++ b/README
@@ -0,0 +1,136 @@
+                  LBS: An LFS-Inspired Backup Solution
+
+How to Build
+------------
+
+Dependencies:
+  - libuuid
+  - sqlite3
+
+Building should be a simple matter of running "make".  This will produce
+an executable called "lbs".
+
+
+Setting up Backups
+------------------
+
+Two directories are needed for backups: one for storing the backup
+snapshots themselves, and one for storing bookkeeping information to go
+with the backups.  In this example, the first will be "/lbs", and the
+second "/lbs.db", but any directories will do.  Only the first
+directory, /lbs, needs to be stored somewhere safe.  The second is only
+used when creating new snapshots, and is not needed when restoring.
+
+  1. Create the snapshot directory and the local database directory:
+        $ mkdir /lbs /lbs.db
+
+  2. Initialize the local database using the provided script schema.sql
+     from the source:
+        $ sqlite3 /lbs.db/localdb.sqlite
+        sqlite> .read schema.sql
+        sqlite> .exit
+
+  3. If encrypting or signing backups with gpg, generate appropriate
+     keypairs.  The keys can be kept in a user keyring or in a separate
+     keyring just for backups; this example does the latter.
+        $ mkdir /lbs.db/gpg; chmod 700 /lbs.db/gpg
+        $ gpg --homedir /lbs.db/gpg --gen-key
+            (generate a keypair for encryption; enter a passphrase for
+            the secret key)
+        $ gpg --homedir /lbs.db/gpg --gen-key
+            (generate a second keypair for signing; for automatic
+            signing do not use a passphrase to protect the secret key)
+     Be sure to store the secret key needed for decryption somewhere
+     safe, perhaps with the backup itself (the key protected with an
+     appropriate passphrase).  The secret signing key need not be stored
+     with the backups (since in the event of data loss, it probably
+     isn't necessary to create future backups that are signed with the
+     same key).
+
+     To achieve better compression, the ecnryption key can be edited to
+     alter the preferred compression algorithms to list bzip2 before
+     zlib.  Run
+        $ gpg --homedir /lbs.db/gpg --edit-key <encryption key>
+        Command> pref
+            (prints a terse listing of preferences associated with the
+            key)
+        Command> setpref
+            (allows preferences to be changed; copy the same preferences
+            list printed out by the previous command, but change the
+            order of the compression algorithms, which start with "Z",
+            to be "Z3 Z2 Z1" which stands for "BZIP2, ZLIB, ZIP")
+        Command> save
+
+    Copy the provided encryption filter program, lbs-filter-gpg,
+    somewhere it may be run from.
+
+  4. Create a script for launching the LBS backup process.  A simple
+     version is:
+
+        #!/bin/sh
+        export LBS_GPG_HOME=/lbs.db/gpg
+        export LBS_GPG_ENC_KEY=<encryption key>
+        export LBS_GPG_SIGN_KEY=<signing key>
+        lbs --dest=/lbs --localdb=/lbs.db
+            --filter="lbs-filter-gpg --encrypt" --filter-extension=.gpg \
+            --signature-filter="lbs-filter-gpg --clearsign" \
+            /etc /home /other/paths/to/store
+
+    Make appropriate substitutions for the key IDs and any relevant
+    paths.  If desired, insert an option "--scheme=<name>" to specify a
+    name for this backup scheme which will be included in the snapshot
+    file names (for example, use a name based on the hostname or
+    descriptive of the files backed up).
+
+
+Backup Maintenance
+------------------
+
+Segment cleaning must periodically be done to identify backup segments
+that are mostly unused, but are storing a small amount of useful data.
+Data in these segments will be rewritten into new segments in future
+backups to eliminate the dependence on the almost-empty old segments.
+
+Segment cleaning is currently a mostly manual process.  An automatic
+tool for performing segment cleaning will be available in the future.
+
+Old backup snapshots can be pruned from the snapshot directory (/lbs) to
+recover space.  Deleting an old backup snapshot is a simlpe matter of
+deleting the appropriate snapshot descriptor file (snapshot-*.lbs) and
+any associated checksums (snapshot-*.sha1sums).  Segments used by that
+snapshot, but not any other snapshots, can be identified by running the
+clean-segments.pl script from the /lbs directory--this will perform a
+scan of the current directory to identify unreferenced segments, and
+will print a list to stdout.  Assuming the list looks reasonable, the
+segments can be quickly deleted with
+    $ rm `./clean-segments.pl`
+
+The clean-segments.pl script will also print out a warning message if
+any snapshots appear to depend upon segments which are not present; this
+is a serious error which indicates that some of the data needed to
+recover a snapshot appears to be lost.
+
+
+Restoring a Snapshot
+--------------------
+
+The restore.pl script is a simple (proof-of-concept, really) program for
+restoring the contents of an LBS snapshot.  Ideally, it should be stored
+with the backup files so it is available if it is needed.
+
+The restore.pl script does not know how to decompress segments, so this
+step must be performed manually.  Create a temporary directory for
+holding all decompressed objects.  Copy the snapshot descriptor file
+(*.lbs) for the snapshot to be restored to this temporary directory.
+The snapshot descriptor includes a list of all segments which are needed
+for the snapshot.  For each of these snapshots, decompress the segment
+file (with gpg or the appropriate program based on whatever filter was
+used), then pipe the resulting data through "tar -xf -" to extract.  Do
+this from the temporary directory; the temporary directory should be
+filled with one directory for each segment decompressed.
+
+Run restore.pl giving two arguments: the snapshot descriptor file
+(*.lbs) in the temporary directory, and a directory where the restored
+files should be written.
+
+A better recovery tool will be provided in the future.
diff --git a/format.txt b/format.txt
index 54d9f8c..78f32c3 100644
--- a/format.txt
+++ b/format.txt
@@ -248,3 +248,7 @@ like the metadata listing).  The fields which are defined are:
         completely reconstruct a snapshot, etc.
     Root: A single object reference which points to the metadata
         listing for the snapshot.
+    Checksums: A checksum file may be produced (with the same name as
+        the snapshot descriptor file, but with extension .sha1sums
+        instead of .lbs) containing SHA-1 checksums of all segments.
+        This field contains a checksum of that file.
diff --git a/implementation.txt b/implementation.txt
new file mode 100644
index 0000000..1ba78ac
--- /dev/null
+++ b/implementation.txt
@@ -0,0 +1,91 @@
+                  LBS: An LFS-Inspired Backup Solution
+                        Implementation Overview
+
+HIGH-LEVEL OVERVIEW
+===================
+
+There are two different classes of data stored, typically in different
+directories:
+
+The SNAPSHOT directory contains the actual backup contents.  It consists
+of segment data (typically in compressed/encrypted form, one segment per
+file) as well as various small per-snapshot files such as the snapshot
+descriptor files (which names each snapshot and tells where to locate
+the data for it) and checksum files (which list checksums of segments
+for quick integrity checking).  The snapshot directory may be stored on
+a remote server.  It is write-only, in the sense that data does not need
+to be read from the snapshot directory to create a new snapshot, and
+files in it are immutable once created (they may be deleted if they are
+no longer needed, but file contents are never changed).
+
+The LOCAL DATABASE contains indexes used during the backup process.
+Files here keep track of what information is known to be stored in the
+snapshot directory, so that new snapshots can appropriate re-use data.
+The local database, as its name implies, should be stored somewhere
+local, since random access (read and write) will be required during the
+backup process.  Unlike the snapshot directory, files here are not
+immutable.
+
+Only the data stored in the snapshot directory is required to restore a
+snapshot.  The local database does not need to be backed up (stored at
+multiple separate locations, etc.).  The contents of the local database
+can be rebuilt (at least in theory) from data in the snapshot directory
+and the local filesystem; it is expected that tools will eventually be
+provided to do so.
+
+The format of data in the snapshot directory is described in format.txt.
+The format of data in the local database is more fluid and may evolve
+over time.  The current structure of the local database is described in
+this document.
+
+
+LOCAL DATABASE FORMAT
+=====================
+
+The local database directory currently contains two files:
+localdb.sqlite and a statcache file.  (Actually, two types of files.  It
+is possible to create snapshots using different schemes, and have them
+share the same local database directory.  In this case, there will still
+be one localdb.sqlite file, but one statcache file for each backup
+scheme.)
+
+Each statcache file is a plain text file, with a format similar to the
+file metadata listing used in the snapshot directory.  The purpose of
+the statcache file is to speed the backup process by making it possible
+to determine if a file has changed since the previous snapshot by
+comparing the results of a stat() system call with the data in the
+statcache file, and if the file is unchanged, providing the checksum and
+list of data blocks used to previously store the file.  The statcache
+file is rewritten each time a snapshot is taken, and can safely be
+deleted (with the only major side effect being that the first backups
+after doing so will progress much more slowly).
+
+localdb.sqlite is an SQLite database file, which is used for indexing
+objects stored in the snapshot directory and various other purposes.
+The database schema is contained in the file schema.sql in the LBS
+source.  Among the data tracked by localdb.sqlite:
+
+  - A list of segments stored in the snapshot directory.  This might not
+    include all segments (segments belonging to old snapshots might be
+    removed), but for correctness all segments listed in the local
+    database must exist in the snapshot directory.
+
+  - A block index which tracks objects in the snapshot directory used to
+    store file data.  It is indexed by block checksum, and so can be
+    used while generating a snapshot to determine if a just-read block
+    of data is already stored in the snapshot directory, and if so how
+    to name it.
+
+  - A list of recent snapshots, together with a list of the objects from
+    the block index they reference.
+
+The localdb SQL database is central to data sharing and segment
+cleaning.  When creating a new snapshot, information about the new
+snapshot and the blocks is uses (including any new ones) is written to
+the database.  Using the database, separate segment cleaning processes
+can determine how much data in various segments is still live, and
+determine which segments are best candidates for cleaning.  Cleaning is
+performed by updating the database to mark objects in the cleaned
+segments as unavailable for use in future snapshots; when the backup
+process next runs, any files that would use these expired blocks instead
+have a copy of the data written to a new segment.
-- 
2.20.1