X-Git-Url: http://git.vrable.net/?a=blobdiff_plain;f=format.txt;fp=format.txt;h=0000000000000000000000000000000000000000;hb=3fed0cd9c087ef575777056666c80397dd23530c;hp=78f32c304824cfc2266d7cbcbe1555fb286ae394;hpb=e164399889abd715cdbcd19f66ce22226dc40152;p=cumulus.git diff --git a/format.txt b/format.txt deleted file mode 100644 index 78f32c3..0000000 --- a/format.txt +++ /dev/null @@ -1,254 +0,0 @@ - Backup Format Description - for an LFS-Inspired Backup Solution - -NOTE: This format specification is not yet complete. Right now the code -provides the best documentation of the format. - -This document simply describes the snapshot format. It is described -from the point of view of a decompressor which wishes to restore the -files from a snapshot. It does not specify the exact behavior required -of the backup program writing the snapshot. - -This document does not explain the rationale behind the format; for -that, see design.txt. - - -DATA CHECKSUMS -============== - -In several places in the LBS format, a cryptographic checksum may be -used to allow data integrity to be verified. At the moment, only the -SHA-1 checksum is supported, but it is expected that other algorithms -will be supported in the future. - -When a checksum is called for, the checksum is always stored in a text -format. The general format used is - = - - identifies the checksum algorithm used, and allows new -algorithms to be added later. At the moment, the only permissible value -is "sha1", indicating a SHA-1 checksum. - - is a sequence of hexadecimal digits which encode the -checksum value. For sha1, should be precisely 40 digits -long. - -A sample checksum string is - sha1=67049e7931ad7db37b5c794d6ad146c82e5f3187 - - -SEGMENTS & OBJECTS: STORAGE AND NAMING -====================================== - -An LBS snapshot consists, at its base, of a collection of /objects/: -binary blobs of data, much like a file. Higher layers interpret the -contents of objects in various ways, but the lowest layer is simply -concerned with storing and naming these objects. - -An object is a sequence of bytes (octets) of arbitrary length. An -object may contain as few as zero bytes (though such objects are not -very useful). Object sizes are potentially unbounded, but it is -recommended that the maximum size of objects produced be on the order of -megabytes. Files of essentially unlimited size can be stored in an LBS -snapshot using objects of modest size, so this should not cause any real -restrictions. - -For storage purposes, objects are grouped together into /segments/. -Segments use the TAR format; each object within a segment is stored as a -separate file. Segments are named using UUIDs (Universally Unique -Identifiers), which are 128-bit numbers. The textual form of a UUID is -a sequence of lowercase hexadecimal digits with hyphens inserted at -fixed points; an example UUID is - a704eeae-97f2-4f30-91a4-d4473956366b -This segment could be stored in the filesystem as a file - a704eeae-97f2-4f30-91a4-d4473956366b.tar -The UUID used to name a segment is assigned when the segment is created. - -Filters can be layered on top of the segment storage to provide -compression, encryption, or other features. For example, the example -segment above might be stored as - a704eeae-97f2-4f30-91a4-d4473956366b.tar.bz2 -or - a704eeae-97f2-4f30-91a4-d4473956366b.tar.gpg -if the file data had been filtered through bzip2 or gpg, respectively, -before storage. Filtering of segment data is outside the scope of this -format specification, however; it is assumed that if filtering is used, -when decompressing the unfiltered data can be recovered (yielding data -in the TAR format). - -Objects within a segment are numbered sequentially. This sequence -number is then formatted as an 8-digit (zero-padded) hexadecimal -(lowercase) value. The fully qualified name of an object consists of -the segment name, followed by a slash ("/"), followed by the object -sequence number. So, for example - a704eeae-97f2-4f30-91a4-d4473956366b/000001ad -names an object. - -Within the segment TAR file, the filename used for each object is its -fully-qualified name. Thus, when extracted using the standard tar -utility, a segment will produce a directory with the same name as the -segment itself, and that directory will contain a set of -sequentially-numbered files each storing the contents of a single -object. - -NOTE: When naming an object, the segment portion consists of the UUID -only. Any extensions appended to the segment when storing it as a file -in the filesystem (for example, .tar.bz2) are _not_ part of the name of -the object. - -There are two additional components which may appear in an object name; -both are optional. - -First, a checksum may be added to the object name to express an -integrity constraint: the referred-to data must match the checksum -given. A checksum is enclosed in parentheses and appended to the object -name: - a704eeae-97f2-4f30-91a4-d4473956366b/000001ad(sha1=67049e7931ad7db37b5c794d6ad146c82e5f3187) - -Secondly, an object may be /sliced/: a subset of the bytes actually -stored in the object may be selected to be returned. The slice syntax -is - [+] -where is the first byte to return (as a decimal offset) and - specifies the number of bytes to return (again in decimal). It -is invalid to select using the slice syntax a range of bytes that does -not fall within the original object. The slice specification should be -appended to an object name, for example: - a704eeae-97f2-4f30-91a4-d4473956366b/000001ad[264+1000] -selects only bytes 264..1263 from the original object. - -Both a checksum and a slice can be used. In this case, the checksum is -given first, followed by the slice. The checksum is computed over the -original object contents, before slicing. - - -FILE METADATA LISTING -===================== - -A snapshot stores two distinct types of data into the object store -described above: data and metadata. Data for a file may be stored as a -single object, or the data may be broken apart into blocks which are -stored as separate objects. The file /metadata/ log (which may be -spread across multiple objects) specifies the names of the files in a -snapshot, metadata about them such as ownership and timestamps, and -gives the list of objects that contain the data for the file. - -The metadata log consists of a set of stanzas, each of which are -formatted somewhat like RFC 822 (email) headers. An example is: - - name: etc/fstab - checksum: sha1=11bd6ec140e4ec3110a91e1dd0f02b63b701421f - data: 2f46bce9-4554-4a60-a4a2-543637bd3989/000001f7 - group: 0 (root) - mode: 0644 - mtime: 1177977313 - size: 867 - type: - - user: 0 (root) - -The meanings of all the fields are described later. A blank line -separates stanzas with information about different files. In addition -to regular stanzas, the metadata listing may contain a line containing -an object reference prefixed with "@". Such a line indicates that the -contents of the referenced object should be fetched and parsed as a -metadata listing at this point, prior to continuing to parse the current -object. - -Several common encodings are used for various fields. The encoding used -for each field is specified in the field listing that follows. - encoded string: An arbitrary string (octet sequence), with bytes - optionally escaped by replacing a byte with %xx, where "xx" is a - hexadecimal representation of the byte replaced. For example, - space can be replaced with "%20". This is the same escaping - mechanism as used in URLs. - integer: An integer, which may be written in decimal, octal, or - hexadecimal. Strings starting with 0 are interpreted as octal, - and those starting with 0x are intepreted as hexadecimal. - -Common fields (required in all stanzas): - name [encoded string]: Full path of the file archived. - user [special]: The user ID of the file, as an integer, optionally - followed by a space and the corresponding username, as an - escaped string enclosed in parentheses. - group [special]: The group ID which owns the file. Encoding is the - same as for the user field: an integer, with an optional name in - parentheses following. - mode [integer]: Unix mode bits for the file. - type [special]: A single character which indicates the type of file. - The type indicators are meant to be consistent with the - characters used to indicate file type in a directory listing: - - regular file - b block device - c character device - d directory - l symlink - p pipe - s socket - mtime [integer]: Modification time of the file. - -Optional common fields: - links [integer]: Number of hard links to this file, generally only - reported if greater than 1. - inode [string]: String specifying the inode number of this file when - it was dumped. If "links" is greater than 1, then searching for - other files that have an identical "inode" value can be used to - determine which files should be hard-linked together when - restoring. The inode field should be treated as an opaque - string and compared for equality as such; an implementation may - choose whatever representation is convenient. The format - produced by the standard tool is // (where - and specify the device of the containing - filesystem and is the inode number of the file). - -Special fields used for regular files: - checksum [string]: Checksum of the file contents. - size [integer]: Size of the file, in bytes. - data [reference list]: Whitespace-separated list of object - references. The referenced data, when concatenated in the - listed order, will reconstruct the file data. Any reference - that begins with a "@" character is an indirect reference--the - given object includes a whitespace-separated list of object - references which should be parsed in the same manner as the data - field. - - -SNAPSHOT DESCRIPTOR -=================== - -The snapshot descriptor is a small file which describes a single -snapshot. It is one of the few files which is not stored as an object -in the segment store. It is stored as a separate file, in plain text, -but in the same directory as segments are stored. - -The name of snapshot descriptor file is - snapshot--.lbs - is a descriptive text which can be used to distinguish several -logically distinct sets of snapshots (such as snapshots for two -different directory trees) that are being stored in the same location. - gives the date and time the snapshot was taken; the format -is %Y%m%dT%H%M%S (20070806T092239 means 2007-08-06 09:22:39). - -The contents of the descriptor are a set of RFC 822-style headers (much -like the metadata listing). The fields which are defined are: - Format: The string "LBS Snapshot v0.2" which identifies this file as - an LBS backup descriptor. The version number (v0.2) might - change if there are changes to the format. It is expected that - at some point, once the format is stabilized, the version - identifier will be changed to v1.0. - Producer: A informative string which identifies the program that - produced the backup. - Date: The date the snapshot was produced. This matches the - timestamp encoded in the filename, but is written out in full. - A timezone is given. For example: "2007-08-06 09:22:39 -0700". - Scheme: The field from the descriptor filename. - Segments: A whitespace-seprated list of segment names. Any segment - which is referenced by this snapshot must be included in the - list, since this list can be used in garbage-collecting old - segments, determining which segments need to be downloaded to - completely reconstruct a snapshot, etc. - Root: A single object reference which points to the metadata - listing for the snapshot. - Checksums: A checksum file may be produced (with the same name as - the snapshot descriptor file, but with extension .sha1sums - instead of .lbs) containing SHA-1 checksums of all segments. - This field contains a checksum of that file.