From: Michael Vrable Date: Mon, 14 May 2007 20:32:55 +0000 (-0700) Subject: Start writing up some design notes for LBS. X-Git-Url: https://git.vrable.net/?a=commitdiff_plain;h=136d05867858042ea22e4f8a53df97c731857e9b;p=cumulus.git Start writing up some design notes for LBS. Try to explain the rationale for the chosen design, and explain some of the other designs that were considered and rejected. --- diff --git a/design.txt b/design.txt new file mode 100644 index 0000000..6aa53e0 --- /dev/null +++ b/design.txt @@ -0,0 +1,109 @@ +This document aims to describe the goals and constraints of the LBS +design. + +======================================================================== + +OVERALL GOALS: Efficient creation and storage of multiple backup +snapshots of a filesystem tree. Logically, each snapshot is +self-contained and stores the state of all files at a single point in +time. However, it should be possible for snapshots to share the same +underlying storage, so that data duplicated in many snapshots need not +be stored multiple times. It should be possible to delete old +snapshots, and recover (most of) the storage associated with them. It +must be possible to delete old backups in any order; for example, it +must be possible to delete intermediate backups before long-term +backups. It should be possible to recover the files in a snapshot +without transferring significantly more data than that stored in the +files to be recovered. + +CONSTRAINTS: The system should not rely upon a smart server at the +remote end where backups are stored. It should be possible to create +new backups using a single primitive: StoreFile, which stores a string +of bytes at the backup server using a specified filename. Thus, backups +can be run over any file transfer protocol, without requiring special +software be installed on the storage server. + +======================================================================== + +DESIGN APPROACHES + +STORING INCREMENTAL BACKUPS + +One simple approach is to simply store a copy of every file one the +remote end, and construct a listing which tells where each file in the +source ends up on the remote server. For subsequent backups, if a file +is unchanged, the listing can simply point to the location of the file +from the previous backup. Deleting backups is simple: delete the +listing file for a particular snapshot, then garbage collect all files +which are no longer referenced. + +This approach does not as efficiently handle partial changes to large +files. If a file is changed at all, it needs to be transferred in its +entirety. One approach is to represent intra-file changes by storing +patches. The original file is kept, and a smaller file is transferred +that stores the differences between the original and the new. Some care +is needed, however. A series of small changes could accumulate over +many snapshots. If each snapshot refers to the original file, much data +will be duplicated between the patches in different snapshots. If each +patch can refer to previous patches as well, a long chain of patches can +build up, which complicates removing old backups to reclaim storage. + +An alternative approach is to break files apart into smaller units +(blocks) and to represent files in a snapshot as the concatenation of +(possibly many) blocks. Small change to files can be represented by +replacing a few of the blocks, but referring to most blocks used in the +old file directly. Some care is needed with this approach as +well--there is additional overhead needed to specify even the original +file, since the entire list of blocks must be specified. If the block +size is too small, this can lead to a large overhead, but if the block +size is too large, then sharing of file data may not be achieved. In +this scheme, data blocks do not depend on other data blocks, so chains +of dependencies do not arise as in the incremental patching scheme. +Each snapshot is independent, and so can easily be removed. + +One minor modification to this scheme is to permit the list of blocks to +specify that only a portion of a block should be used to reconstruct a +file; if, say, only the end of a block is changed, then the new backup +can refer to most of the old block, and use a new block for the small +changed part. Doing so does allow the possibility that a block might be +kept around even though a portion of it is being used, leading to wasted +space. + + +DATA STORAGE + +The simplest data storage format would place each file, patch, or block +in a separate file on the storage server. Doing so maximizes the +ability to reclaim storage when deleting old snapshots, and minimizes +the amount of extra data that must be transferred to recover a snapshot. +Any other format which combines data from multiple files/patches/blocks +together risks having needed data grouped with unwanted data. + +However, there are reasons to consider grouping, since there is overhead +associated with storing many small files. In any transfer protocol +which is not pipelined, transferring many small files may be slower than +transferring the same quantity of data in larger files. Small files may +also lead to more wasted storage space due to internal fragmentation. + +Grouping is even more important if the snapshot format breaks files +apart into blocks for storage, since the number of blocks could be far +larger than the number of files being backed up. + +======================================================================== + +SELECTED DESIGN + +At a high level, the selected design stores snapshots by breaking files +into blocks for storage, and does not use patches. These data blocks, +along with the metadata fragments (collectively, the blocks and metadata +are referred to as objects) are grouped together for storage purposes +(each storage group is called a segment). + +TAR is chosen as the format for grouping objects together into segments +rather than inventing a new format. Doing so makes it easy to +manipulate the segments using other tools, if needed. + +Data blocks for files are stored as-is. Metadata is stored in a text +format, to make it more transparent. (This should make debugging +easier, and the hope is that this will make understanding the format +simpler.)