Backup Format Description for an LFS-Inspired Backup Solution NOTE: This is simply a proposal at this point in time, and not yet implemented. Details are subject to change. ======================================================================== Goals: To provide a stable and extensible data storage format for efficient remote filesystem backups. Among the features desired in the format are: - Support for grouping unchanging file contents together, and reusing it for future backups. - Nonetheless allow old backups to be deleted (at least those parts that are not also used by newer backups). - Support some form of rdiff-style incremental differences within a file. The current plan is to implement compression and encryption separately: not as part of the base format, but simply by passing the backup data through filters such as bzip2 or gpg. Data is organized into a collection of _objects_, which are grouped together for storage purposes into _segments_. Objects may refer to other objects; a snapshot consists of a tree object which in turn refers to other objects containing file data. A new snapshot may be created which refers to some of the old objects with file data, if those files have not changed. ======================================================================== Object naming: - Each segment is assigned a unique 128-bit identifier (uuid). Each segment is stored as a separate file whose name is based on its uuid. - Objects within a segment are numbered sequentially, with a 32-bit counter. Thus, each object may be referred to with a unique 160 (128 + 32) bit identifier. Segment structure: There are two main options: - Streaming format: Each object is prepended with a header, and then all (header, object) pairs are concatenated. This is inspired by the tar file format. Can be written out in one pass and also processed when read back in one pass. Well-adapted to streaming transformations, such as compression. - Indexed format: Each segment contains a table giving the starting position and length of each object. This is somewhat similar to PDF. Data can still be written out in a single pass, but reading will require random access. File attributes: Metadata for each file is stored in a dictionary. Dictionary keys include: type: uint8_t ('p', 's', 'c', 'b', 'l', 'd', '-') mode: uint16_t user: uint32_t group: uint32_t size: int64_t atime: int64_t mtime: int64_t ctime: int64_t