bluesky/DESIGN

   1                          BlueSky Storage Design
   2
   3 At a basic level we use a fairly standard filesystem design: the
   4 filesystem consists of a collection of inodes which in turn point to
   5 data blocks.
   6
   7 There are three different levels at which data can be stored/committed:
   8   1. In memory within the BlueSky proxy
   9   2. Written to local disk at the proxy
  10   3. Committed to cloud storage
  11
  12 Updates can be made at level 1 with very low overhead, and the hope is
  13 that consecutive updates are batched together at this point.  Data is
  14 flushed from 1 to 2 by writing out updated data blocks and serializing
  15 new versions of inodes.  In a log-structured design, data is commited
  16 from level 2 to 3 by grouping together dependent items into log segments
  17 and flushing those log segments to the cloud.  Some data might be
  18 garbage-collected if it becomes dead before it is flushed to level 3.
  19 Encryption is likely only implemented at level 3 (or possibly at level
  20 2).
  21
  22 Level 1 is primarily an implementation detail of the BlueSky library and
  23 need not preseve any inter-version compatibility.
  24
  25 For level 2 we have a collection of objects with edges between them
  26 representing two types of dependencies: data dependencies (as between an
  27 inode and the data blocks it references), and ordering dependencies (for
  28 example, a directory should not be committed until any new inodes it
  29 references are also committed, though there is not a data dependency
  30 since the directory only references the inode number not the inode
  31 itself).  Data blocks are unversioned (if we change the data in a file,
  32 we write a new data block and point the inode at the new block name).
  33 Inodes are versioned (since we can update an inode, but references to it
  34 in directory entries are not updated) and we need a mechanism to keep
  35 track of the most recent version of an inode--an inode map.  Another way
  36 of looking at this is that data block pointers can be dereferenced
  37 directly, but dereferencing an inode number requires a layer of
  38 indirection (the inode map).
  39
  40 Should objects have back references to track where pointers to them
  41 exist?  One simple implementation would be to track which inode each
  42 data block belongs to, though this doesn't work well with snapshots or
  43 deduplication.
  44
  45 Level 3 consists of objects from level 2, aggregated together into log
  46 segments.  There are a few complications:
  47   - At level 3 we add in encryption, but:
  48   - We want to be able to fetch individual objects using range requests,
  49     so the encryption needs to be either per-object or allow decryption
  50     to start from mid-file.
  51   - Log cleaning ought to be able to run in the cloud, without the
  52     encryption key, so some data such as inter-object references must be
  53     outside the encryption wrapper.
  54
  55 A possible level-3 object format:
  56   UNPROTECTED
  57     List of referenced objects and locations
  58   AUTHENTICATED
  59     Object identifier (does not change even if segment is repacked)
  60     Object identifiers for referenced objects?
  61   ENCRYPTED
  62     Data payload
  63       (references are given as an index into the list in the unprotected
  64       section, so a cleaner can rearrange objects without decrypting)
  65
  66
  67 Object Types/Formats
  68   SUPERBLOCK
  69     Either stored separately from log segments at a well-known location,
  70     or have log-segments named in a well-known fashion and place
  71     superblock at a known location in the log segments.
  72
  73     Contains pointers to inode maps, and perhaps to old superblocks too
  74     if we don't want to rewrite all this information each time.
  75
  76   INODE MAP BLOCK
  77     Lists the current location of each inode in the logs, for some range
  78     of the inode number space.
  79
  80   INODE
  81     Contains file metadata and pointers to data blocks.  The metadata
  82     can be encrypted, but cleaner process needs read/write access to the
  83     data pointers.
  84
  85     In addition to the plaintext pointers there should be a way to
  86     validate that the pointed-to data is correct.  This could either be
  87     a hash of the data block pointed to, or an ID stored with the data
  88     block (where the ID does not change even if the data block is
  89     relocated to another log segment).
  90
  91   DATA BLOCK
  92     Encrypted file data, but includes a back reference the inode using
  93     this block as plaintext.  (The back reference is simply the inode
  94     number, and possibly, though it is not needed, the offset within the
  95     file.)
  96
  97 Determining live data:
  98     Checking whether an inode is live can be done by comparing against
  99     the current inode map.
 100
 101     To check whether data is live, each data block lists the inode it
 102     belongs to.  The data is live if the most current version of the
 103     inode still points to this block.  These back references are also
 104     used when data is relocated during cleaning.  This does mean,
 105     however, that each block can only be used in one location (no
 106     de-duplication support), unless we add some other mechanism for
 107     tracking back-references (there is one bit of related work that
 108     might apply, but it's not worth implementing now).
 109
 110 On-line Cleaning
 111     Online cleaning is a specialized case of handling concurrent writers
 112     to the same filesystem, with a few twists.
 113
 114     The cleaning process should be able to run in EC2 without the
 115     filesystem encryption key, so pointers between objects must not be
 116     encrypted.
 117
 118     The proxy editing the filesystem and the cleaner may run in
 119     parallel, each writing to a separate log head.  One process or the
 120     other must then merge any divergent changes together.  This should
 121     be easy to do in this one specific case, though, since on the proxy
 122     is actually changing data and the cleaner is only rearranging
 123     objects in the logs (and updating pointers)--thus there shouldn't be
 124     conflicts that can't be automatically handled.