Cumulus: Efficient Filesystem Backup to the Cloud
                        Implementation Overview

HIGH-LEVEL OVERVIEW
===================

There are two different classes of data stored, typically in different
directories:

The SNAPSHOT directory contains the actual backup contents.  It consists
of segment data (typically in compressed/encrypted form, one segment per
file) as well as various small per-snapshot files such as the snapshot
descriptor files (which names each snapshot and tells where to locate
the data for it) and checksum files (which list checksums of segments
for quick integrity checking).  The snapshot directory may be stored on
a remote server.  It is write-only, in the sense that data does not need
to be read from the snapshot directory to create a new snapshot, and
files in it are immutable once created (they may be deleted if they are
no longer needed, but file contents are never changed).

The LOCAL DATABASE contains indexes used during the backup process.
Files here keep track of what information is known to be stored in the
snapshot directory, so that new snapshots can appropriate re-use data.
The local database, as its name implies, should be stored somewhere
local, since random access (read and write) will be required during the
backup process.  Unlike the snapshot directory, files here are not
immutable.

Only the data stored in the snapshot directory is required to restore a
snapshot.  The local database does not need to be backed up (stored at
multiple separate locations, etc.).  The contents of the local database
can be rebuilt (at least in theory) from data in the snapshot directory
and the local filesystem; it is expected that tools will eventually be
provided to do so.

The format of data in the snapshot directory is described in format.txt.
The format of data in the local database is more fluid and may evolve
over time.  The current structure of the local database is described in
this document.


LOCAL DATABASE FORMAT
=====================

The local database directory currently contains two files:
localdb.sqlite and a statcache file.  (Actually, two types of files.  It
is possible to create snapshots using different schemes, and have them
share the same local database directory.  In this case, there will still
be one localdb.sqlite file, but one statcache file for each backup
scheme.)

Each statcache file is a plain text file, with a format similar to the
file metadata listing used in the snapshot directory.  The purpose of
the statcache file is to speed the backup process by making it possible
to determine if a file has changed since the previous snapshot by
comparing the results of a stat() system call with the data in the
statcache file, and if the file is unchanged, providing the checksum and
list of data blocks used to previously store the file.  The statcache
file is rewritten each time a snapshot is taken, and can safely be
deleted (with the only major side effect being that the first backups
after doing so will progress much more slowly).

localdb.sqlite is an SQLite database file, which is used for indexing
objects stored in the snapshot directory and various other purposes.
The database schema is contained in the file schema.sql in the Cumulus
source.  Among the data tracked by localdb.sqlite:

  - A list of segments stored in the snapshot directory.  This might not
    include all segments (segments belonging to old snapshots might be
    removed), but for correctness all segments listed in the local
    database must exist in the snapshot directory.

  - A block index which tracks objects in the snapshot directory used to
    store file data.  It is indexed by block checksum, and so can be
    used while generating a snapshot to determine if a just-read block
    of data is already stored in the snapshot directory, and if so how
    to name it.

  - A list of recent snapshots, together with a list of the objects from
    the block index they reference.

The localdb SQL database is central to data sharing and segment
cleaning.  When creating a new snapshot, information about the new
snapshot and the blocks is uses (including any new ones) is written to
the database.  Using the database, separate segment cleaning processes
can determine how much data in various segments is still live, and
determine which segments are best candidates for cleaning.  Cleaning is
performed by updating the database to mark objects in the cleaned
segments as unavailable for use in future snapshots; when the backup
process next runs, any files that would use these expired blocks instead
have a copy of the data written to a new segment.