Replace boost::scoped_ptr with std::unique_ptr.

[cumulus.git] / doc / format.txt
diff --git a/doc/format.txt b/doc/format.txt

index 5c9d5fa..2c40075 100644 (file)
--- a/doc/format.txt
+++ b/doc/format.txt
@@ -1,23 +1,52 @@
                         Backup Format Description
-                  for an LFS-Inspired Backup Solution
-                      Version: "LBS Snapshot v0.6"
+         for Cumulus: Efficient Filesystem Backup to the Cloud
+                   Version: "Cumulus Snapshot v0.11"
  
-NOTE: This format specification is not yet complete.  Right now the code
-provides the best documentation of the format.
+NOTE: This format specification is intended to be mostly stable, but is
+still subject to change before the 1.0 release.  The code may provide
+additional useful documentation on the format.
+
+NOTE2: The name of this project has changed from LBS to Cumulus.  In
+some areas the name "LBS" is still used.
  
  This document simply describes the snapshot format.  It is described
  from the point of view of a decompressor which wishes to restore the
  files from a snapshot.  It does not specify the exact behavior required
-of the backup program writing the snapshot.
+of the backup program writing the snapshot.  For details of the current
+backup program, see implementation.txt.
  
  This document does not explain the rationale behind the format; for
  that, see design.txt.
  
  
+BACKUP REPOSITORY LAYOUT
+========================
+
+Cumulus backups are stored using a relatively simple layout.  Data files
+described below are written into one of several directories on the
+backup server, depending on their purpose:
+    snapshots/
+        Snapshot descriptor files, which quickly summarize each backup
+        snapshot stored.
+    segments0/
+    segments1/
+        Storage of the bulk of the backup data, in compressed/encrypted
+        form.  Technically any segment could be stored in either
+        directory (both directories will be searched when looking for a
+        segment).  However, data in segments0 might be faster to access
+        (but more expensive) depending on the storage backend.  The
+        intent is that segments0 can store filesystem tree metadata and
+        segments1 can store file contents.
+    meta/
+        Snapshot-specific metadata that is not core to the backup.  This
+        can include checksums of segments, some data for rebuilding
+        local database contents, etc.
+
+
  DATA CHECKSUMS
  ==============
  
-In several places in the LBS format, a cryptographic checksum may be
+In several places in the Cumulus format, a cryptographic checksum may be
  used to allow data integrity to be verified.  At the moment, only the
  SHA-1 checksum is supported, but it is expected that other algorithms
  will be supported in the future.
@@ -27,8 +56,10 @@ format.  The general format used is
      <algorithm>=<hexdigits>
  
  <algorithm> identifies the checksum algorithm used, and allows new
-algorithms to be added later.  At the moment, the only permissible value
-is "sha1", indicating a SHA-1 checksum.
+algorithms to be added later.  Permissible values are:
+    "sha1": SHA-1
+    "sha224": SHA-224 (added in version 0.11)
+    "sha256": SHA-256 (added in version 0.11)
  
  <hexdigits> is a sequence of hexadecimal digits which encode the
  checksum value.  For sha1, <hexdigits> should be precisely 40 digits
@@ -41,7 +72,7 @@ A sample checksum string is
  SEGMENTS & OBJECTS: STORAGE AND NAMING
  ======================================
  
-An LBS snapshot consists, at its base, of a collection of /objects/:
+A Cumulus snapshot consists, at its base, of a collection of /objects/:
  binary blobs of data, much like a file.  Higher layers interpret the
  contents of objects in various ways, but the lowest layer is simply
  concerned with storing and naming these objects.
@@ -50,9 +81,9 @@ An object is a sequence of bytes (octets) of arbitrary length.  An
  object may contain as few as zero bytes (though such objects are not
  very useful).  Object sizes are potentially unbounded, but it is
  recommended that the maximum size of objects produced be on the order of
-megabytes.  Files of essentially unlimited size can be stored in an LBS
-snapshot using objects of modest size, so this should not cause any real
-restrictions.
+megabytes.  Files of essentially unlimited size can be stored in a
+Cumulus snapshot using objects of modest size, so this should not cause
+any real restrictions.
  
  For storage purposes, objects are grouped together into /segments/.
  Segments use the TAR format; each object within a segment is stored as a
@@ -64,6 +95,8 @@ fixed points; an example UUID is
  This segment could be stored in the filesystem as a file
      a704eeae-97f2-4f30-91a4-d4473956366b.tar
  The UUID used to name a segment is assigned when the segment is created.
+These files are stored in either the segments0 or segments1 directories
+on the backup server.
  
  Filters can be layered on top of the segment storage to provide
  compression, encryption, or other features.  For example, the example
@@ -94,8 +127,8 @@ object.
  
  NOTE: When naming an object, the segment portion consists of the UUID
  only.  Any extensions appended to the segment when storing it as a file
-in the filesystem (for example, .tar.bz2) are _not_ part of the name of
-the object.
+in the filesystem (for example, .tar.bz2) and path information (for
+example, segments0) are _not_ part of the name of the object.
  
  There are two additional components which may appear in an object name;
  both are optional.
@@ -118,10 +151,35 @@ appended to an object name, for example:
      a704eeae-97f2-4f30-91a4-d4473956366b/000001ad[264+1000]
  selects only bytes 264..1263 from the original object.
  
+The slice syntax
+    [<length>]
+indicates that all bytes of the object are to be used, but
+additionally asserts that the referenced object is exactly <length>
+bytes long.  Older versions of Cumulus can also use the syntax
+    [=<length>]
+as a synonym for length assertions, but this notation is deprecated.
+
+(In older versions of the format, the syntax [<length>] was a shorthand
+for [0+<length>]: that is, select the first <length> bytes of the object
+but make no assertions about the overall size.  The backup tool has not
+generated such slices since v0.8.)
+
  Both a checksum and a slice can be used.  In this case, the checksum is
  given first, followed by the slice.  The checksum is computed over the
  original object contents, before slicing.
  
+Special Objects
+---------------
+
+In addition to the standard syntax for objects described above, the
+special name "zero" may be used instead of segment/sequence number.
+This represents an object consisting entirely of zeroes.  The zero
+object must have a slice specification appended to indicate the size of
+the object.  For example
+    zero[1024]
+represents a block consisting of 1024 null bytes.  A checksum should not
+be given.
+
  
  FILE METADATA LISTING
  =====================
@@ -244,20 +302,29 @@ The name of snapshot descriptor file is
  logically distinct sets of snapshots (such as snapshots for two
  different directory trees) that are being stored in the same location.
  <timestamp> gives the date and time the snapshot was taken; the format
-is %Y%m%dT%H%M%S (20070806T092239 means 2007-08-06 09:22:39).
+is %Y%m%dT%H%M%S (20070806T092239 means 2007-08-06 09:22:39).  It is
+recommended that the timestamp be given in UTC for consistent sorting
+even if the offset from UTC to local time changes, however the
+authoritative timestamp (including timezone) can be found in the Date
+field.  (In version v0.10 and earlier the timestamp is given in local
+time; in current versions UTC is used.)
  
  The contents of the descriptor are a set of RFC 822-style headers (much
  like the metadata listing).  The fields which are defined are:
-    Format: The string "LBS Snapshot v0.6" which identifies this file as
-        an LBS backup descriptor.  The version number (v0.6) might
-        change if there are changes to the format.  It is expected that
-        at some point, once the format is stabilized, the version
-        identifier will be changed to v1.0.
+    Format: The string "Cumulus Snapshot v0.11" which identifies this
+        file as a Cumulus backup descriptor.  The version number (v0.11)
+        might change if there are changes to the format.  It is expected
+        that at some point, once the format is stabilized, the version
+        identifier will be changed to v1.0.  (Earlier versions, format
+        v0.8 and earlier, used the string "LBS Snapshot" instead of
+        "Cumulus Snapshot", reflecting an earlier name for the project.
+        Consumers should be prepared for either name.)
      Producer: A informative string which identifies the program that
          produced the backup.
-    Date: The date the snapshot was produced.  This matches the
-        timestamp encoded in the filename, but is written out in full.
-        A timezone is given.  For example: "2007-08-06 09:22:39 -0700".
+    Date: The date the snapshot was produced, in the local time zone.
+        This matches the timestamp encoded in the filename, but is
+        written out in full.  A timezone (offset from UTC) is given.
+        For example: "2007-08-06 02:22:39 -0700".
      Scheme: The <scheme> field from the descriptor filename.
      Segments: A whitespace-seprated list of segment names.  Any segment
          which is referenced by this snapshot must be included in the
@@ -270,3 +337,6 @@ like the metadata listing).  The fields which are defined are:
          the snapshot descriptor file, but with extension .sha1sums
          instead of .lbs) containing SHA-1 checksums of all segments.
          This field contains a checksum of that file.
+    Intent: Informational; records the value of the --intent flag when
+        the snapshot was created, and can be used when determining which
+        snapshots to later delete.