BlueSky Storage Design

At a basic level we use a fairly standard filesystem design: the
filesystem consists of a collection of inodes which in turn point to
data blocks.

There are three different levels at which data can be stored/committed:
  1. In memory within the BlueSky proxy
  2. Written to local disk at the proxy
  3. Committed to cloud storage

Updates can be made at level 1 with very low overhead, and the hope is
that consecutive updates are batched together at this point.  Data is
flushed from 1 to 2 by writing out updated data blocks and serializing
new versions of inodes.  In a log-structured design, data is commited
from level 2 to 3 by grouping together dependent items into log segments
and flushing those log segments to the cloud.  Some data might be
garbage-collected if it becomes dead before it is flushed to level 3.
Encryption is likely only implemented at level 3 (or possibly at level
2).

Level 1 is primarily an implementation detail of the BlueSky library and
need not preseve any inter-version compatibility.

For level 2 we have a collection of objects with edges between them
representing two types of dependencies: data dependencies (as between an
inode and the data blocks it references), and ordering dependencies (for
example, a directory should not be committed until any new inodes it
references are also committed, though there is not a data dependency
since the directory only references the inode number not the inode
itself).  Data blocks are unversioned (if we change the data in a file,
we write a new data block and point the inode at the new block name).
Inodes are versioned (since we can update an inode, but references to it
in directory entries are not updated) and we need a mechanism to keep
track of the most recent version of an inode--an inode map.  Another way
of looking at this is that data block pointers can be dereferenced
directly, but dereferencing an inode number requires a layer of
indirection (the inode map).

Should objects have back references to track where pointers to them
exist?  One simple implementation would be to track which inode each
data block belongs to, though this doesn't work well with snapshots or
deduplication.

Level 3 consists of objects from level 2, aggregated together into log
segments.  There are a few complications:
  - At level 3 we add in encryption, but:
  - We want to be able to fetch individual objects using range requests,
    so the encryption needs to be either per-object or allow decryption
    to start from mid-file.
  - Log cleaning ought to be able to run in the cloud, without the
    encryption key, so some data such as inter-object references must be
    outside the encryption wrapper.

A possible level-3 object format:
  UNPROTECTED
    List of referenced objects and locations
  AUTHENTICATED
    Object identifier (does not change even if segment is repacked)
    Object identifiers for referenced objects?
  ENCRYPTED
    Data payload
      (references are given as an index into the list in the unprotected
      section, so a cleaner can rearrange objects without decrypting)


Object Types/Formats
  SUPERBLOCK
    Either stored separately from log segments at a well-known location,
    or have log-segments named in a well-known fashion and place
    superblock at a known location in the log segments.

    Contains pointers to inode maps, and perhaps to old superblocks too
    if we don't want to rewrite all this information each time.

  INODE MAP BLOCK
    Lists the current location of each inode in the logs, for some range
    of the inode number space.

  INODE
    Contains file metadata and pointers to data blocks.  The metadata
    can be encrypted, but cleaner process needs read/write access to the
    data pointers.

    In addition to the plaintext pointers there should be a way to
    validate that the pointed-to data is correct.  This could either be
    a hash of the data block pointed to, or an ID stored with the data
    block (where the ID does not change even if the data block is
    relocated to another log segment).

  DATA BLOCK
    Encrypted file data, but includes a back reference the inode using
    this block as plaintext.  (The back reference is simply the inode
    number, and possibly, though it is not needed, the offset within the
    file.)

Determining live data:
    Checking whether an inode is live can be done by comparing against
    the current inode map.

    To check whether data is live, each data block lists the inode it
    belongs to.  The data is live if the most current version of the
    inode still points to this block.  These back references are also
    used when data is relocated during cleaning.  This does mean,
    however, that each block can only be used in one location (no
    de-duplication support), unless we add some other mechanism for
    tracking back-references (there is one bit of related work that
    might apply, but it's not worth implementing now).

On-line Cleaning
    Online cleaning is a specialized case of handling concurrent writers
    to the same filesystem, with a few twists.

    The cleaning process should be able to run in EC2 without the
    filesystem encryption key, so pointers between objects must not be
    encrypted.

    The proxy editing the filesystem and the cleaner may run in
    parallel, each writing to a separate log head.  One process or the
    other must then merge any divergent changes together.  This should
    be easy to do in this one specific case, though, since on the proxy
    is actually changing data and the cleaner is only rearranging
    objects in the logs (and updating pointers)--thus there shouldn't be
    conflicts that can't be automatically handled.