1 Backup Format Description
2 for an LFS-Inspired Backup Solution
3 Version: "LBS Snapshot v0.6"
5 NOTE: This format specification is not yet complete. Right now the code
6 provides the best documentation of the format.
8 This document simply describes the snapshot format. It is described
9 from the point of view of a decompressor which wishes to restore the
10 files from a snapshot. It does not specify the exact behavior required
11 of the backup program writing the snapshot.
13 This document does not explain the rationale behind the format; for
20 In several places in the LBS format, a cryptographic checksum may be
21 used to allow data integrity to be verified. At the moment, only the
22 SHA-1 checksum is supported, but it is expected that other algorithms
23 will be supported in the future.
25 When a checksum is called for, the checksum is always stored in a text
26 format. The general format used is
27 <algorithm>=<hexdigits>
29 <algorithm> identifies the checksum algorithm used, and allows new
30 algorithms to be added later. At the moment, the only permissible value
31 is "sha1", indicating a SHA-1 checksum.
33 <hexdigits> is a sequence of hexadecimal digits which encode the
34 checksum value. For sha1, <hexdigits> should be precisely 40 digits
37 A sample checksum string is
38 sha1=67049e7931ad7db37b5c794d6ad146c82e5f3187
41 SEGMENTS & OBJECTS: STORAGE AND NAMING
42 ======================================
44 An LBS snapshot consists, at its base, of a collection of /objects/:
45 binary blobs of data, much like a file. Higher layers interpret the
46 contents of objects in various ways, but the lowest layer is simply
47 concerned with storing and naming these objects.
49 An object is a sequence of bytes (octets) of arbitrary length. An
50 object may contain as few as zero bytes (though such objects are not
51 very useful). Object sizes are potentially unbounded, but it is
52 recommended that the maximum size of objects produced be on the order of
53 megabytes. Files of essentially unlimited size can be stored in an LBS
54 snapshot using objects of modest size, so this should not cause any real
57 For storage purposes, objects are grouped together into /segments/.
58 Segments use the TAR format; each object within a segment is stored as a
59 separate file. Segments are named using UUIDs (Universally Unique
60 Identifiers), which are 128-bit numbers. The textual form of a UUID is
61 a sequence of lowercase hexadecimal digits with hyphens inserted at
62 fixed points; an example UUID is
63 a704eeae-97f2-4f30-91a4-d4473956366b
64 This segment could be stored in the filesystem as a file
65 a704eeae-97f2-4f30-91a4-d4473956366b.tar
66 The UUID used to name a segment is assigned when the segment is created.
68 Filters can be layered on top of the segment storage to provide
69 compression, encryption, or other features. For example, the example
70 segment above might be stored as
71 a704eeae-97f2-4f30-91a4-d4473956366b.tar.bz2
73 a704eeae-97f2-4f30-91a4-d4473956366b.tar.gpg
74 if the file data had been filtered through bzip2 or gpg, respectively,
75 before storage. Filtering of segment data is outside the scope of this
76 format specification, however; it is assumed that if filtering is used,
77 when decompressing the unfiltered data can be recovered (yielding data
80 Objects within a segment are numbered sequentially. This sequence
81 number is then formatted as an 8-digit (zero-padded) hexadecimal
82 (lowercase) value. The fully qualified name of an object consists of
83 the segment name, followed by a slash ("/"), followed by the object
84 sequence number. So, for example
85 a704eeae-97f2-4f30-91a4-d4473956366b/000001ad
88 Within the segment TAR file, the filename used for each object is its
89 fully-qualified name. Thus, when extracted using the standard tar
90 utility, a segment will produce a directory with the same name as the
91 segment itself, and that directory will contain a set of
92 sequentially-numbered files each storing the contents of a single
95 NOTE: When naming an object, the segment portion consists of the UUID
96 only. Any extensions appended to the segment when storing it as a file
97 in the filesystem (for example, .tar.bz2) are _not_ part of the name of
100 There are two additional components which may appear in an object name;
103 First, a checksum may be added to the object name to express an
104 integrity constraint: the referred-to data must match the checksum
105 given. A checksum is enclosed in parentheses and appended to the object
107 a704eeae-97f2-4f30-91a4-d4473956366b/000001ad(sha1=67049e7931ad7db37b5c794d6ad146c82e5f3187)
109 Secondly, an object may be /sliced/: a subset of the bytes actually
110 stored in the object may be selected to be returned. The slice syntax
113 where <start> is the first byte to return (as a decimal offset) and
114 <length> specifies the number of bytes to return (again in decimal). It
115 is invalid to select using the slice syntax a range of bytes that does
116 not fall within the original object. The slice specification should be
117 appended to an object name, for example:
118 a704eeae-97f2-4f30-91a4-d4473956366b/000001ad[264+1000]
119 selects only bytes 264..1263 from the original object.
121 Both a checksum and a slice can be used. In this case, the checksum is
122 given first, followed by the slice. The checksum is computed over the
123 original object contents, before slicing.
126 FILE METADATA LISTING
127 =====================
129 A snapshot stores two distinct types of data into the object store
130 described above: data and metadata. Data for a file may be stored as a
131 single object, or the data may be broken apart into blocks which are
132 stored as separate objects. The file /metadata/ log (which may be
133 spread across multiple objects) specifies the names of the files in a
134 snapshot, metadata about them such as ownership and timestamps, and
135 gives the list of objects that contain the data for the file.
137 The metadata log consists of a set of stanzas, each of which are
138 formatted somewhat like RFC 822 (email) headers. An example is:
141 checksum: sha1=11bd6ec140e4ec3110a91e1dd0f02b63b701421f
142 data: 2f46bce9-4554-4a60-a4a2-543637bd3989/000001f7
150 The meanings of all the fields are described later. A blank line
151 separates stanzas with information about different files. In addition
152 to regular stanzas, the metadata listing may contain a line containing
153 an object reference prefixed with "@". Such a line indicates that the
154 contents of the referenced object should be fetched and parsed as a
155 metadata listing at this point, prior to continuing to parse the current
158 Several common encodings are used for various fields. The encoding used
159 for each field is specified in the field listing that follows.
160 encoded string: An arbitrary string (octet sequence), with bytes
161 optionally escaped by replacing a byte with %xx, where "xx" is a
162 hexadecimal representation of the byte replaced. For example,
163 space can be replaced with "%20". This is the same escaping
164 mechanism as used in URLs.
165 integer: An integer, which may be written in decimal, octal, or
166 hexadecimal. Strings starting with 0 are interpreted as octal,
167 and those starting with 0x are intepreted as hexadecimal.
169 Common fields (required in all stanzas):
170 path [encoded string]: Full path of the file archived. Note: In
171 previous versions (<= 0.2) the name of this field was "name".
172 user [special]: The user ID of the file, as an integer, optionally
173 followed by a space and the corresponding username, as an
174 escaped string enclosed in parentheses.
175 group [special]: The group ID which owns the file. Encoding is the
176 same as for the user field: an integer, with an optional name in
177 parentheses following.
178 mode [integer]: Unix mode bits for the file.
179 type [special]: A single character which indicates the type of file.
180 The type indicators are meant to be consistent with the
181 characters used with the -type option to find(1), and the file
182 type checks in test(1):
190 Note that previous versions used '-' to indicate a regular file.
191 This character should not be generated in any new snapshots, but
192 may be encountered in old snapshots (those with a format version
194 mtime [integer]: Modification time of the file.
196 Optional common fields:
197 links [integer]: Number of hard links to this file, generally only
198 reported if greater than 1.
199 inode [string]: String specifying the inode number of this file when
200 it was dumped. If "links" is greater than 1, then searching for
201 other files that have an identical "inode" value can be used to
202 determine which files should be hard-linked together when
203 restoring. The inode field should be treated as an opaque
204 string and compared for equality as such; an implementation may
205 choose whatever representation is convenient. The format
206 produced by the standard tool is <major>/<minor>/<inode> (where
207 <major> and <minor> specify the device of the containing
208 filesystem and <inode> is the inode number of the file).
209 ctime [integer]: Change time for the inode.
211 Special fields used for regular files:
212 checksum [string]: Checksum of the file contents.
213 size [integer]: Size of the file, in bytes.
214 data [reference list]: Whitespace-separated list of object
215 references. The referenced data, when concatenated in the
216 listed order, will reconstruct the file data. Any reference
217 that begins with a "@" character is an indirect reference--the
218 given object includes a whitespace-separated list of object
219 references which should be parsed in the same manner as the data
222 Special fields used for symbolic links:
223 target[encoded string]: The target of the symlink, as returned by
224 readlink(2). Note: In old version of the format (<= 0.2), this
225 field was called "contents" instead of "target".
227 Special fields used for block and character device files:
228 device[special]: The major and minor number of the device. Encoded
229 as "major/minor", where major is the major device number encoded
230 into an integer, and minor is the minor device number.
236 The snapshot descriptor is a small file which describes a single
237 snapshot. It is one of the few files which is not stored as an object
238 in the segment store. It is stored as a separate file, in plain text,
239 but in the same directory as segments are stored.
241 The name of snapshot descriptor file is
242 snapshot-<scheme>-<timestamp>.lbs
243 <scheme> is a descriptive text which can be used to distinguish several
244 logically distinct sets of snapshots (such as snapshots for two
245 different directory trees) that are being stored in the same location.
246 <timestamp> gives the date and time the snapshot was taken; the format
247 is %Y%m%dT%H%M%S (20070806T092239 means 2007-08-06 09:22:39).
249 The contents of the descriptor are a set of RFC 822-style headers (much
250 like the metadata listing). The fields which are defined are:
251 Format: The string "LBS Snapshot v0.6" which identifies this file as
252 an LBS backup descriptor. The version number (v0.6) might
253 change if there are changes to the format. It is expected that
254 at some point, once the format is stabilized, the version
255 identifier will be changed to v1.0.
256 Producer: A informative string which identifies the program that
258 Date: The date the snapshot was produced. This matches the
259 timestamp encoded in the filename, but is written out in full.
260 A timezone is given. For example: "2007-08-06 09:22:39 -0700".
261 Scheme: The <scheme> field from the descriptor filename.
262 Segments: A whitespace-seprated list of segment names. Any segment
263 which is referenced by this snapshot must be included in the
264 list, since this list can be used in garbage-collecting old
265 segments, determining which segments need to be downloaded to
266 completely reconstruct a snapshot, etc.
267 Root: A single object reference which points to the metadata
268 listing for the snapshot.
269 Checksums: A checksum file may be produced (with the same name as
270 the snapshot descriptor file, but with extension .sha1sums
271 instead of .lbs) containing SHA-1 checksums of all segments.
272 This field contains a checksum of that file.