From: Michael Vrable <mvrable@cs.ucsd.edu>
Date: Wed, 30 Jun 2010 23:59:14 +0000 (-0700)
Subject: Some new format design notes.
X-Git-Url: http://git.vrable.net/?a=commitdiff_plain;h=ecf88ab2b1c8d590c6db57571ad9d7e16c2df45b;p=bluesky.git

Some new format design notes.
---

diff --git a/bluesky/DESIGN b/bluesky/DESIGN
new file mode 100644
index 0000000..78ad412
--- /dev/null
+++ b/bluesky/DESIGN
@@ -0,0 +1,124 @@
+                         BlueSky Storage Design
+
+At a basic level we use a fairly standard filesystem design: the
+filesystem consists of a collection of inodes which in turn point to
+data blocks.
+
+There are three different levels at which data can be stored/committed:
+  1. In memory within the BlueSky proxy
+  2. Written to local disk at the proxy
+  3. Committed to cloud storage
+
+Updates can be made at level 1 with very low overhead, and the hope is
+that consecutive updates are batched together at this point.  Data is
+flushed from 1 to 2 by writing out updated data blocks and serializing
+new versions of inodes.  In a log-structured design, data is commited
+from level 2 to 3 by grouping together dependent items into log segments
+and flushing those log segments to the cloud.  Some data might be
+garbage-collected if it becomes dead before it is flushed to level 3.
+Encryption is likely only implemented at level 3 (or possibly at level
+2).
+
+Level 1 is primarily an implementation detail of the BlueSky library and
+need not preseve any inter-version compatibility.
+
+For level 2 we have a collection of objects with edges between them
+representing two types of dependencies: data dependencies (as between an
+inode and the data blocks it references), and ordering dependencies (for
+example, a directory should not be committed until any new inodes it
+references are also committed, though there is not a data dependency
+since the directory only references the inode number not the inode
+itself).  Data blocks are unversioned (if we change the data in a file,
+we write a new data block and point the inode at the new block name).
+Inodes are versioned (since we can update an inode, but references to it
+in directory entries are not updated) and we need a mechanism to keep
+track of the most recent version of an inode--an inode map.  Another way
+of looking at this is that data block pointers can be dereferenced
+directly, but dereferencing an inode number requires a layer of
+indirection (the inode map).
+
+Should objects have back references to track where pointers to them
+exist?  One simple implementation would be to track which inode each
+data block belongs to, though this doesn't work well with snapshots or
+deduplication.
+
+Level 3 consists of objects from level 2, aggregated together into log
+segments.  There are a few complications:
+  - At level 3 we add in encryption, but:
+  - We want to be able to fetch individual objects using range requests,
+    so the encryption needs to be either per-object or allow decryption
+    to start from mid-file.
+  - Log cleaning ought to be able to run in the cloud, without the
+    encryption key, so some data such as inter-object references must be
+    outside the encryption wrapper.
+
+A possible level-3 object format:
+  UNPROTECTED
+    List of referenced objects and locations
+  AUTHENTICATED
+    Object identifier (does not change even if segment is repacked)
+    Object identifiers for referenced objects?
+  ENCRYPTED
+    Data payload
+      (references are given as an index into the list in the unprotected
+      section, so a cleaner can rearrange objects without decrypting)
+
+
+Object Types/Formats
+  SUPERBLOCK
+    Either stored separately from log segments at a well-known location,
+    or have log-segments named in a well-known fashion and place
+    superblock at a known location in the log segments.
+
+    Contains pointers to inode maps, and perhaps to old superblocks too
+    if we don't want to rewrite all this information each time.
+
+  INODE MAP BLOCK
+    Lists the current location of each inode in the logs, for some range
+    of the inode number space.
+
+  INODE
+    Contains file metadata and pointers to data blocks.  The metadata
+    can be encrypted, but cleaner process needs read/write access to the
+    data pointers.
+
+    In addition to the plaintext pointers there should be a way to
+    validate that the pointed-to data is correct.  This could either be
+    a hash of the data block pointed to, or an ID stored with the data
+    block (where the ID does not change even if the data block is
+    relocated to another log segment).
+
+  DATA BLOCK
+    Encrypted file data, but includes a back reference the inode using
+    this block as plaintext.  (The back reference is simply the inode
+    number, and possibly, though it is not needed, the offset within the
+    file.)
+
+Determining live data:
+    Checking whether an inode is live can be done by comparing against
+    the current inode map.
+
+    To check whether data is live, each data block lists the inode it
+    belongs to.  The data is live if the most current version of the
+    inode still points to this block.  These back references are also
+    used when data is relocated during cleaning.  This does mean,
+    however, that each block can only be used in one location (no
+    de-duplication support), unless we add some other mechanism for
+    tracking back-references (there is one bit of related work that
+    might apply, but it's not worth implementing now).
+
+On-line Cleaning
+    Online cleaning is a specialized case of handling concurrent writers
+    to the same filesystem, with a few twists.
+
+    The cleaning process should be able to run in EC2 without the
+    filesystem encryption key, so pointers between objects must not be
+    encrypted.
+
+    The proxy editing the filesystem and the cleaner may run in
+    parallel, each writing to a separate log head.  One process or the
+    other must then merge any divergent changes together.  This should
+    be easy to do in this one specific case, though, since on the proxy
+    is actually changing data and the cleaner is only rearranging
+    objects in the logs (and updating pointers)--thus there shouldn't be
+    conflicts that can't be automatically handled.