3 At a basic level we use a fairly standard filesystem design: the
4 filesystem consists of a collection of inodes which in turn point to
7 There are three different levels at which data can be stored/committed:
8 1. In memory within the BlueSky proxy
9 2. Written to local disk at the proxy
10 3. Committed to cloud storage
12 Updates can be made at level 1 with very low overhead, and the hope is
13 that consecutive updates are batched together at this point. Data is
14 flushed from 1 to 2 by writing out updated data blocks and serializing
15 new versions of inodes. In a log-structured design, data is commited
16 from level 2 to 3 by grouping together dependent items into log segments
17 and flushing those log segments to the cloud. Some data might be
18 garbage-collected if it becomes dead before it is flushed to level 3.
19 Encryption is likely only implemented at level 3 (or possibly at level
22 Level 1 is primarily an implementation detail of the BlueSky library and
23 need not preseve any inter-version compatibility.
25 For level 2 we have a collection of objects with edges between them
26 representing two types of dependencies: data dependencies (as between an
27 inode and the data blocks it references), and ordering dependencies (for
28 example, a directory should not be committed until any new inodes it
29 references are also committed, though there is not a data dependency
30 since the directory only references the inode number not the inode
31 itself). Data blocks are unversioned (if we change the data in a file,
32 we write a new data block and point the inode at the new block name).
33 Inodes are versioned (since we can update an inode, but references to it
34 in directory entries are not updated) and we need a mechanism to keep
35 track of the most recent version of an inode--an inode map. Another way
36 of looking at this is that data block pointers can be dereferenced
37 directly, but dereferencing an inode number requires a layer of
38 indirection (the inode map).
40 Should objects have back references to track where pointers to them
41 exist? One simple implementation would be to track which inode each
42 data block belongs to, though this doesn't work well with snapshots or
45 Level 3 consists of objects from level 2, aggregated together into log
46 segments. There are a few complications:
47 - At level 3 we add in encryption, but:
48 - We want to be able to fetch individual objects using range requests,
49 so the encryption needs to be either per-object or allow decryption
50 to start from mid-file.
51 - Log cleaning ought to be able to run in the cloud, without the
52 encryption key, so some data such as inter-object references must be
53 outside the encryption wrapper.
55 A possible level-3 object format:
57 List of referenced objects and locations
59 Object identifier (does not change even if segment is repacked)
60 Object identifiers for referenced objects?
63 (references are given as an index into the list in the unprotected
64 section, so a cleaner can rearrange objects without decrypting)
69 Either stored separately from log segments at a well-known location,
70 or have log-segments named in a well-known fashion and place
71 superblock at a known location in the log segments.
73 Contains pointers to inode maps, and perhaps to old superblocks too
74 if we don't want to rewrite all this information each time.
77 Lists the current location of each inode in the logs, for some range
78 of the inode number space.
81 Contains file metadata and pointers to data blocks. The metadata
82 can be encrypted, but cleaner process needs read/write access to the
85 In addition to the plaintext pointers there should be a way to
86 validate that the pointed-to data is correct. This could either be
87 a hash of the data block pointed to, or an ID stored with the data
88 block (where the ID does not change even if the data block is
89 relocated to another log segment).
92 Encrypted file data, but includes a back reference the inode using
93 this block as plaintext. (The back reference is simply the inode
94 number, and possibly, though it is not needed, the offset within the
97 Determining live data:
98 Checking whether an inode is live can be done by comparing against
99 the current inode map.
101 To check whether data is live, each data block lists the inode it
102 belongs to. The data is live if the most current version of the
103 inode still points to this block. These back references are also
104 used when data is relocated during cleaning. This does mean,
105 however, that each block can only be used in one location (no
106 de-duplication support), unless we add some other mechanism for
107 tracking back-references (there is one bit of related work that
108 might apply, but it's not worth implementing now).
111 Online cleaning is a specialized case of handling concurrent writers
112 to the same filesystem, with a few twists.
114 The cleaning process should be able to run in EC2 without the
115 filesystem encryption key, so pointers between objects must not be
118 The proxy editing the filesystem and the cleaner may run in
119 parallel, each writing to a separate log head. One process or the
120 other must then merge any divergent changes together. This should
121 be easy to do in this one specific case, though, since on the proxy
122 is actually changing data and the cleaner is only rearranging
123 objects in the logs (and updating pointers)--thus there shouldn't be
124 conflicts that can't be automatically handled.