--- /dev/null
+ BlueSky Storage Design
+
+At a basic level we use a fairly standard filesystem design: the
+filesystem consists of a collection of inodes which in turn point to
+data blocks.
+
+There are three different levels at which data can be stored/committed:
+ 1. In memory within the BlueSky proxy
+ 2. Written to local disk at the proxy
+ 3. Committed to cloud storage
+
+Updates can be made at level 1 with very low overhead, and the hope is
+that consecutive updates are batched together at this point. Data is
+flushed from 1 to 2 by writing out updated data blocks and serializing
+new versions of inodes. In a log-structured design, data is commited
+from level 2 to 3 by grouping together dependent items into log segments
+and flushing those log segments to the cloud. Some data might be
+garbage-collected if it becomes dead before it is flushed to level 3.
+Encryption is likely only implemented at level 3 (or possibly at level
+2).
+
+Level 1 is primarily an implementation detail of the BlueSky library and
+need not preseve any inter-version compatibility.
+
+For level 2 we have a collection of objects with edges between them
+representing two types of dependencies: data dependencies (as between an
+inode and the data blocks it references), and ordering dependencies (for
+example, a directory should not be committed until any new inodes it
+references are also committed, though there is not a data dependency
+since the directory only references the inode number not the inode
+itself). Data blocks are unversioned (if we change the data in a file,
+we write a new data block and point the inode at the new block name).
+Inodes are versioned (since we can update an inode, but references to it
+in directory entries are not updated) and we need a mechanism to keep
+track of the most recent version of an inode--an inode map. Another way
+of looking at this is that data block pointers can be dereferenced
+directly, but dereferencing an inode number requires a layer of
+indirection (the inode map).
+
+Should objects have back references to track where pointers to them
+exist? One simple implementation would be to track which inode each
+data block belongs to, though this doesn't work well with snapshots or
+deduplication.
+
+Level 3 consists of objects from level 2, aggregated together into log
+segments. There are a few complications:
+ - At level 3 we add in encryption, but:
+ - We want to be able to fetch individual objects using range requests,
+ so the encryption needs to be either per-object or allow decryption
+ to start from mid-file.
+ - Log cleaning ought to be able to run in the cloud, without the
+ encryption key, so some data such as inter-object references must be
+ outside the encryption wrapper.
+
+A possible level-3 object format:
+ UNPROTECTED
+ List of referenced objects and locations
+ AUTHENTICATED
+ Object identifier (does not change even if segment is repacked)
+ Object identifiers for referenced objects?
+ ENCRYPTED
+ Data payload
+ (references are given as an index into the list in the unprotected
+ section, so a cleaner can rearrange objects without decrypting)
+
+
+Object Types/Formats
+ SUPERBLOCK
+ Either stored separately from log segments at a well-known location,
+ or have log-segments named in a well-known fashion and place
+ superblock at a known location in the log segments.
+
+ Contains pointers to inode maps, and perhaps to old superblocks too
+ if we don't want to rewrite all this information each time.
+
+ INODE MAP BLOCK
+ Lists the current location of each inode in the logs, for some range
+ of the inode number space.
+
+ INODE
+ Contains file metadata and pointers to data blocks. The metadata
+ can be encrypted, but cleaner process needs read/write access to the
+ data pointers.
+
+ In addition to the plaintext pointers there should be a way to
+ validate that the pointed-to data is correct. This could either be
+ a hash of the data block pointed to, or an ID stored with the data
+ block (where the ID does not change even if the data block is
+ relocated to another log segment).
+
+ DATA BLOCK
+ Encrypted file data, but includes a back reference the inode using
+ this block as plaintext. (The back reference is simply the inode
+ number, and possibly, though it is not needed, the offset within the
+ file.)
+
+Determining live data:
+ Checking whether an inode is live can be done by comparing against
+ the current inode map.
+
+ To check whether data is live, each data block lists the inode it
+ belongs to. The data is live if the most current version of the
+ inode still points to this block. These back references are also
+ used when data is relocated during cleaning. This does mean,
+ however, that each block can only be used in one location (no
+ de-duplication support), unless we add some other mechanism for
+ tracking back-references (there is one bit of related work that
+ might apply, but it's not worth implementing now).
+
+On-line Cleaning
+ Online cleaning is a specialized case of handling concurrent writers
+ to the same filesystem, with a few twists.
+
+ The cleaning process should be able to run in EC2 without the
+ filesystem encryption key, so pointers between objects must not be
+ encrypted.
+
+ The proxy editing the filesystem and the cleaner may run in
+ parallel, each writing to a separate log head. One process or the
+ other must then merge any divergent changes together. This should
+ be easy to do in this one specific case, though, since on the proxy
+ is actually changing data and the cleaner is only rearranging
+ objects in the logs (and updating pointers)--thus there shouldn't be
+ conflicts that can't be automatically handled.