From: Michael Vrable <mvrable@cs.ucsd.edu>
Date: Mon, 14 May 2007 20:32:55 +0000 (-0700)
Subject: Start writing up some design notes for LBS.
X-Git-Url: https://git.vrable.net/?a=commitdiff_plain;h=136d05867858042ea22e4f8a53df97c731857e9b;p=cumulus.git

Start writing up some design notes for LBS.

Try to explain the rationale for the chosen design, and explain some of the
other designs that were considered and rejected.
---

diff --git a/design.txt b/design.txt
new file mode 100644
index 0000000..6aa53e0
--- /dev/null
+++ b/design.txt
@@ -0,0 +1,109 @@
+This document aims to describe the goals and constraints of the LBS
+design.
+
+========================================================================
+
+OVERALL GOALS: Efficient creation and storage of multiple backup
+snapshots of a filesystem tree.  Logically, each snapshot is
+self-contained and stores the state of all files at a single point in
+time.  However, it should be possible for snapshots to share the same
+underlying storage, so that data duplicated in many snapshots need not
+be stored multiple times.  It should be possible to delete old
+snapshots, and recover (most of) the storage associated with them.  It
+must be possible to delete old backups in any order; for example, it
+must be possible to delete intermediate backups before long-term
+backups.  It should be possible to recover the files in a snapshot
+without transferring significantly more data than that stored in the
+files to be recovered.
+
+CONSTRAINTS: The system should not rely upon a smart server at the
+remote end where backups are stored.  It should be possible to create
+new backups using a single primitive: StoreFile, which stores a string
+of bytes at the backup server using a specified filename.  Thus, backups
+can be run over any file transfer protocol, without requiring special
+software be installed on the storage server.
+
+========================================================================
+
+DESIGN APPROACHES
+
+STORING INCREMENTAL BACKUPS
+
+One simple approach is to simply store a copy of every file one the
+remote end, and construct a listing which tells where each file in the
+source ends up on the remote server.  For subsequent backups, if a file
+is unchanged, the listing can simply point to the location of the file
+from the previous backup.  Deleting backups is simple: delete the
+listing file for a particular snapshot, then garbage collect all files
+which are no longer referenced.
+
+This approach does not as efficiently handle partial changes to large
+files.  If a file is changed at all, it needs to be transferred in its
+entirety.  One approach is to represent intra-file changes by storing
+patches.  The original file is kept, and a smaller file is transferred
+that stores the differences between the original and the new.  Some care
+is needed, however.  A series of small changes could accumulate over
+many snapshots.  If each snapshot refers to the original file, much data
+will be duplicated between the patches in different snapshots.  If each
+patch can refer to previous patches as well, a long chain of patches can
+build up, which complicates removing old backups to reclaim storage.
+
+An alternative approach is to break files apart into smaller units
+(blocks) and to represent files in a snapshot as the concatenation of
+(possibly many) blocks.  Small change to files can be represented by
+replacing a few of the blocks, but referring to most blocks used in the
+old file directly.  Some care is needed with this approach as
+well--there is additional overhead needed to specify even the original
+file, since the entire list of blocks must be specified.  If the block
+size is too small, this can lead to a large overhead, but if the block
+size is too large, then sharing of file data may not be achieved.  In
+this scheme, data blocks do not depend on other data blocks, so chains
+of dependencies do not arise as in the incremental patching scheme.
+Each snapshot is independent, and so can easily be removed.
+
+One minor modification to this scheme is to permit the list of blocks to
+specify that only a portion of a block should be used to reconstruct a
+file; if, say, only the end of a block is changed, then the new backup
+can refer to most of the old block, and use a new block for the small
+changed part.  Doing so does allow the possibility that a block might be
+kept around even though a portion of it is being used, leading to wasted
+space.
+
+
+DATA STORAGE
+
+The simplest data storage format would place each file, patch, or block
+in a separate file on the storage server.  Doing so maximizes the
+ability to reclaim storage when deleting old snapshots, and minimizes
+the amount of extra data that must be transferred to recover a snapshot.
+Any other format which combines data from multiple files/patches/blocks
+together risks having needed data grouped with unwanted data.
+
+However, there are reasons to consider grouping, since there is overhead
+associated with storing many small files.  In any transfer protocol
+which is not pipelined, transferring many small files may be slower than
+transferring the same quantity of data in larger files.  Small files may
+also lead to more wasted storage space due to internal fragmentation.
+
+Grouping is even more important if the snapshot format breaks files
+apart into blocks for storage, since the number of blocks could be far
+larger than the number of files being backed up.
+
+========================================================================
+
+SELECTED DESIGN
+
+At a high level, the selected design stores snapshots by breaking files
+into blocks for storage, and does not use patches.  These data blocks,
+along with the metadata fragments (collectively, the blocks and metadata
+are referred to as objects) are grouped together for storage purposes
+(each storage group is called a segment).
+
+TAR is chosen as the format for grouping objects together into segments
+rather than inventing a new format.  Doing so makes it easy to
+manipulate the segments using other tools, if needed.
+
+Data blocks for files are stored as-is.  Metadata is stored in a text
+format, to make it more transparent.  (This should make debugging
+easier, and the hope is that this will make understanding the format
+simpler.)