Michael Vrable [Mon, 18 Oct 2010 20:03:15 +0000 (13:03 -0700)]
Small utility to use free memory in a system.
For benchmarking, to take memory away from the page cache.
Michael Vrable [Mon, 18 Oct 2010 04:40:44 +0000 (21:40 -0700)]
Support encrypted log items in the cleaner.
Michael Vrable [Sun, 17 Oct 2010 23:18:23 +0000 (16:18 -0700)]
Delete dead code.
Michael Vrable [Sun, 17 Oct 2010 23:17:40 +0000 (16:17 -0700)]
When decrypting a log item also clear out the IV field.
Not really needed, but this way the IV field being zero should be
synonymous with an unencrypted log item.
Michael Vrable [Sun, 17 Oct 2010 20:40:20 +0000 (13:40 -0700)]
Add per-item encryption/authentication to the cloud log storage.
We should generate encrypted data and decrypt it again on read, but we
don't yet enforce only reading data which passes the integrity check.
Michael Vrable [Fri, 15 Oct 2010 18:01:45 +0000 (11:01 -0700)]
Allow S3 bucket used for BlueSky storage to be specified.
Michael Vrable [Mon, 11 Oct 2010 18:35:17 +0000 (11:35 -0700)]
Work on a simple workload generator for benchmarking.
Michael Vrable [Fri, 8 Oct 2010 23:57:35 +0000 (16:57 -0700)]
Start adding in selective encryption of cloud log items.
Not fully hooked in, but some of the logic for encryption is written now.
Michael Vrable [Mon, 27 Sep 2010 23:30:13 +0000 (16:30 -0700)]
Another logging fix.
Michael Vrable [Mon, 27 Sep 2010 20:28:57 +0000 (13:28 -0700)]
Fix for journal committing.
Sometimes we could previously, under load, report that journal items were
committed when they were not. Try to track the uncommitted state more
carefully now.
Michael Vrable [Mon, 27 Sep 2010 17:57:59 +0000 (10:57 -0700)]
Implement handling of unstable data in WRITE/COMMIT nfs procedures.
Before all NFS operations were synchronous; now we support asynchronous
commits of file writes which might improve performance.
Michael Vrable [Mon, 27 Sep 2010 05:55:19 +0000 (22:55 -0700)]
Updated microbenchmarking script.
Michael Vrable [Wed, 22 Sep 2010 20:52:28 +0000 (13:52 -0700)]
Starting work on scripts to automate benchmarking.
Michael Vrable [Wed, 22 Sep 2010 18:47:17 +0000 (11:47 -0700)]
Improve cleaner performance.
When reading an object in, seek to and read just the needed bytes instead
of the entire log segment. Improves performance significantly.
Michael Vrable [Mon, 20 Sep 2010 22:09:12 +0000 (15:09 -0700)]
Remove obsolete file.
Michael Vrable [Mon, 20 Sep 2010 15:56:01 +0000 (08:56 -0700)]
Use a thread pool for inode fetches, and remove some debugging output.
Michael Vrable [Mon, 20 Sep 2010 15:55:03 +0000 (08:55 -0700)]
Remove an extraneous mutex unlock.
I'm surprised that this didn't cause trouble earlier; it seems that
unlocking an unlocked mutex raises no errors (but under heavy load, when
the mutex is locked by another thread then unlocking it can cause trouble).
Michael Vrable [Mon, 20 Sep 2010 03:07:27 +0000 (20:07 -0700)]
More fixes for memory management.
This should allow memory and cache space to be reclaimed by not keeping
items pinned in memory, finally. Still needs a bit more testing.
Michael Vrable [Sun, 19 Sep 2010 22:22:02 +0000 (15:22 -0700)]
Work on reducing memory pinned by the inode map.
Michael Vrable [Sun, 19 Sep 2010 18:31:15 +0000 (11:31 -0700)]
Add cleaner option to rewrite and compact all inodes (but not all data).
Michael Vrable [Sun, 19 Sep 2010 04:22:15 +0000 (21:22 -0700)]
Allow cloudlog items to be unreferenced in the background.
This is to avoid certain deadlocks, when we don't care if resources are
reclaimed immediately or not.
Michael Vrable [Sun, 19 Sep 2010 00:44:07 +0000 (17:44 -0700)]
Do not hold references to all inode data in inode map.
The inode map should not hold full refereces to all inode objects and all
corresponding data, since that will lock all such data in memory or the
disk cache. Do some initial work towards just holding weak references to
the data so that it can be expired from the cache.
Michael Vrable [Wed, 15 Sep 2010 20:29:42 +0000 (13:29 -0700)]
Restart journal sequence numbering properly.
Handle the case even where some old journal files have been deleted.
Michael Vrable [Tue, 14 Sep 2010 21:49:39 +0000 (14:49 -0700)]
Add in header fields for per-object encryption/authentication.
These aren't yet used.
Michael Vrable [Sat, 11 Sep 2010 00:11:05 +0000 (17:11 -0700)]
Add very basic caching to the cleaner S3 backend.
Michael Vrable [Fri, 10 Sep 2010 22:58:28 +0000 (15:58 -0700)]
Add S3 backend for the cleaner.
It does not yet cache files so performance is poor.
Michael Vrable [Fri, 10 Sep 2010 22:58:09 +0000 (15:58 -0700)]
Properly set the starting inode number for allocation after restarting.
Michael Vrable [Fri, 10 Sep 2010 20:53:00 +0000 (13:53 -0700)]
Fix for S3 list operation.
Michael Vrable [Fri, 10 Sep 2010 20:30:52 +0000 (13:30 -0700)]
Implement a list operation for the S3 storage backend.
Michael Vrable [Fri, 10 Sep 2010 00:59:03 +0000 (17:59 -0700)]
Drop encryption from the cloud storage backend.
Encryption should properly be provided at another layer, so in preparation
for that remove it from the storage layer.
Michael Vrable [Thu, 9 Sep 2010 20:30:39 +0000 (13:30 -0700)]
Allow cleaner to delete unused log segments.
It will not delete segments cleaned out in the current pass, just those
that were unreferenced at the start of the cleaning process.
Michael Vrable [Thu, 9 Sep 2010 19:09:52 +0000 (12:09 -0700)]
Improve segment cleaning.
Michael Vrable [Thu, 9 Sep 2010 06:44:49 +0000 (23:44 -0700)]
Improve cleaner: make sure new logs are written after existing ones.
Michael Vrable [Thu, 9 Sep 2010 05:07:17 +0000 (22:07 -0700)]
Extend cleaner with a simple policy for choosing segments to clean.
Michael Vrable [Thu, 9 Sep 2010 04:03:31 +0000 (21:03 -0700)]
Updates to the Python cleaner prototype.
This can now read in the old inode maps, rewrite inode data, and write out
an updated inode map/checkpoint.
Michael Vrable [Wed, 8 Sep 2010 20:24:42 +0000 (13:24 -0700)]
Begin work on a segment cleaner prototype.
Right now this can rebuild an inode map and compute segment utilization,
though it isn't very efficient.
Michael Vrable [Tue, 7 Sep 2010 21:10:50 +0000 (14:10 -0700)]
Add partial journal replay to filesystem recovery.
After loading a cloud checkpoint, replay just the last portion of the
journal that may not have been committed to the cloud.
Michael Vrable [Tue, 7 Sep 2010 20:28:42 +0000 (13:28 -0700)]
Include inode numbers in cloud log items.
Michael Vrable [Tue, 7 Sep 2010 20:26:10 +0000 (13:26 -0700)]
Finish up loading of checkpoints from cloud logs.
This is now working, at least with a very minimal test.
Michael Vrable [Tue, 7 Sep 2010 04:49:07 +0000 (21:49 -0700)]
In-progress work to implement inode map loading at server start.
This will be used to restore filesystem state from the cloud when the
program starts up again. It still needs more optimization, needs journal
replay to be run afterwards, and bugfixing.
Michael Vrable [Mon, 6 Sep 2010 05:27:34 +0000 (22:27 -0700)]
Improve object deserialization: properly parse object headers.
Michael Vrable [Thu, 2 Sep 2010 22:53:07 +0000 (15:53 -0700)]
Start at writing out inode maps to cloud storage.
Michael Vrable [Wed, 1 Sep 2010 18:15:29 +0000 (11:15 -0700)]
Fixes for journal replay, and drop the "superblock" cloud item.
That file in the cloud wasn't storing useful information any longer and
wasn't being used much; drop it. It's functionality will be replaced with
some form of commit log in the cloud journal.
Michael Vrable [Tue, 31 Aug 2010 22:00:09 +0000 (15:00 -0700)]
Fix some resource leaks in journal replay.
Michael Vrable [Tue, 31 Aug 2010 21:06:53 +0000 (14:06 -0700)]
Implement basic full log replay.
This still needs some checking over for bugs and minor fixes. It replays
the entire journal from the start to rebuild filesystem state. Still
needed: partial joural replay, starting from a checkpoint in the cloud.
Michael Vrable [Tue, 31 Aug 2010 20:02:38 +0000 (13:02 -0700)]
Update CRC-32 implementation.
Invert the result of the CRC computation at the end. This will catch extra
null bytes at the end of the buffer, but required updating the CRC
validation.
Michael Vrable [Fri, 27 Aug 2010 23:16:17 +0000 (16:16 -0700)]
Add in some support for journal replay.
This isn't all functional yet, but making it fully functional will require
updating the data fields that are written to the journal first...
Michael Vrable [Thu, 26 Aug 2010 00:28:26 +0000 (17:28 -0700)]
Start work on log replay for filesystem recovery.
Right now this implements scanning of one journal segment with consistency
checking to find items that were written out.
Also fix the checksum calculation on log entries so that they will validate
properly (we want to compute the checksum so that on validation, computing
the checksum of the entire object results in a value of zero).
Michael Vrable [Wed, 25 Aug 2010 20:24:34 +0000 (13:24 -0700)]
Add an inode map data structure to track the location of inodes in logs.
Michael Vrable [Tue, 24 Aug 2010 23:53:00 +0000 (16:53 -0700)]
Update logic for flushing data to cloud.
Do not force a commit of the most recent data, and instead just write out
whatever was last written to the journal.
This could be a win or a loss:
- We do not need to force a sync of all data to the journal when we
upload data to the cloud.
- But, we may end up writing out old data, which we'll then need to
overwrite a short time later.
Michael Vrable [Mon, 23 Aug 2010 17:21:37 +0000 (10:21 -0700)]
Make cache size run-time configurable.
Michael Vrable [Sun, 22 Aug 2010 05:42:35 +0000 (22:42 -0700)]
Implement new scheme for retaining needed journal segments.
Write full filesystem snapshots to the cloud, and keep track of the journal
position before the snapshot process starts. When it finishes, the journal
segments before that mark can be reclaimed (if needed).
This could be improved but should at least be safe.
Michael Vrable [Sun, 22 Aug 2010 04:48:50 +0000 (21:48 -0700)]
Fix a longstanding(?) memory-leak bug when truncating a file.
Michael Vrable [Fri, 20 Aug 2010 20:51:31 +0000 (13:51 -0700)]
Back out dirty reference tracking, as the design was flawed.
Objects can be written to the journal but not to the cloud--for example, if
a data block is written to the journal but overwritten before the file is
flushed to the cloud. This write-combining is good, but the old code for
tracking when a journal segment could be reclaimed couldn't handle this.
So, back out that dirty reference tracking code, in preparation for
replacing it with another approach.
Michael Vrable [Fri, 20 Aug 2010 00:16:53 +0000 (17:16 -0700)]
Make cloud storage more robust.
- Do not consider data committed until we get a reply from the cloud.
- Add retries on write and on read.
Michael Vrable [Thu, 19 Aug 2010 00:03:29 +0000 (17:03 -0700)]
Add a target size for the cache, and prune the cache when it gets larger.
Michael Vrable [Wed, 18 Aug 2010 20:45:40 +0000 (13:45 -0700)]
Track journal files which contain dirty data and which can be reclaimed.
Michael Vrable [Wed, 18 Aug 2010 01:46:58 +0000 (18:46 -0700)]
Implement a (dumb) cache garbage collector.
This is a proof of concept; it doesn't delete journal files and deletes
cache files nearly as soon as they are unused, so it needs better
algorithms for choosing when to delete files. But it does seem to work.
Michael Vrable [Tue, 17 Aug 2010 22:23:40 +0000 (15:23 -0700)]
Improve journal/cloud cache locking and add access time tracking.
Michael Vrable [Tue, 17 Aug 2010 18:02:56 +0000 (11:02 -0700)]
Debugging/refcount cleanups.
Michael Vrable [Mon, 16 Aug 2010 22:13:53 +0000 (15:13 -0700)]
Minor bugfixes/tweaks.
Michael Vrable [Mon, 16 Aug 2010 19:04:03 +0000 (12:04 -0700)]
First attempt at supporting reading data back from cloud log segments.
There are still some bugs, hacks, race conditions, etc., but this seems to
be doing mostly the right thing and so is a good start.
Michael Vrable [Sat, 14 Aug 2010 22:47:40 +0000 (15:47 -0700)]
Serialized inode data should be dropped from caches, too.
Michael Vrable [Thu, 12 Aug 2010 05:16:35 +0000 (22:16 -0700)]
Attempt at limiting the rate at which memory is dirtied.
Michael Vrable [Wed, 11 Aug 2010 22:58:39 +0000 (15:58 -0700)]
Reference counting bugfix.
Michael Vrable [Wed, 11 Aug 2010 20:15:14 +0000 (13:15 -0700)]
Newly-created inodes should be marked as modified.
The NFS proxy code previously didn't do this, with the result that some
inodes (symlinks were first noticed, but the problem affected other areas
too) would not get entered into the appropriate LRU lists.
Michael Vrable [Wed, 11 Aug 2010 19:43:19 +0000 (12:43 -0700)]
More aggressively use memory-mapped data for cloud log items.
Replace a string with a memory-mapped version as soon as possible when the
item is written out. The intent is that memory-mapped versions rely on the
kernel's memory management, and don't need to be written to swap like a
private copy would, and so should give better overall system memory
management.
Michael Vrable [Wed, 11 Aug 2010 19:29:09 +0000 (12:29 -0700)]
Improve tracking of memory usage in BlueSky.
Most data, except for dirty data blocks not flushed out to the journal yet,
will be in the form of cloud log entries. Create statistics counters to
track how many cloud log items are in each of several states (in memory
only, writeback, on disk, in cloud).
Michael Vrable [Tue, 10 Aug 2010 21:19:49 +0000 (14:19 -0700)]
More fixes to BlueSky cache management.
Michael Vrable [Tue, 10 Aug 2010 00:36:17 +0000 (17:36 -0700)]
Drop old code for flushing data to the cloud.
Michael Vrable [Tue, 10 Aug 2010 00:21:56 +0000 (17:21 -0700)]
Work to unify the cloud segment writing with other cache management.
Michael Vrable [Mon, 9 Aug 2010 23:00:57 +0000 (16:00 -0700)]
Split cloud log segments into modestly-sized chunks.
Michael Vrable [Fri, 6 Aug 2010 21:08:59 +0000 (14:08 -0700)]
Add a null storage implementation.
This simply discards all data written to it. Useful for testing purposes,
if all the data remains in the local log and so we never need to fetch data
from the storage implementation.
Michael Vrable [Thu, 5 Aug 2010 16:48:41 +0000 (09:48 -0700)]
Fix some memory leaks.
Michael Vrable [Thu, 5 Aug 2010 16:38:20 +0000 (09:38 -0700)]
Make links between cloud log entries direct.
Rather than giving the ID provide a direct pointer to the object.
Michael Vrable [Wed, 4 Aug 2010 22:32:00 +0000 (15:32 -0700)]
Rework caching of data blocks to eliminate double-caching.
Previously data could be cached both as a cloud log item and as a string at
the inode level. Now cache as a string for dirty data and a cloud log item
for clean data.
Michael Vrable [Wed, 4 Aug 2010 04:58:51 +0000 (21:58 -0700)]
Fix up reference counting for cloud log items.
Michael Vrable [Wed, 4 Aug 2010 04:21:20 +0000 (21:21 -0700)]
A few attempted bugfixes for log data lifetimes.
A much better fix will depend on reworking this code a bit more.
Michael Vrable [Tue, 3 Aug 2010 21:22:25 +0000 (14:22 -0700)]
Fix up reference counting for memory-mapped journal log segments.
Michael Vrable [Tue, 3 Aug 2010 21:10:21 +0000 (14:10 -0700)]
Improve the reading back of objects committed to the journal.
Implement a cache of memory-mapped log files so that when multiple objects
are requested we can re-use the mapping. Make log files fixed sizes (call
ftruncate when opening the log file) so the entire thing can be memory
mapped at the start.
Michael Vrable [Tue, 3 Aug 2010 05:05:28 +0000 (22:05 -0700)]
More cache behavior tweaks.
Michael Vrable [Tue, 3 Aug 2010 03:48:21 +0000 (20:48 -0700)]
Preliminary support for dropping cached file data from memory.
Michael Vrable [Mon, 2 Aug 2010 22:50:37 +0000 (15:50 -0700)]
Work to allow mmap-ed log entries to be used for data blocks.
Michael Vrable [Fri, 30 Jul 2010 23:17:30 +0000 (16:17 -0700)]
Gradually converting code to use cloud logs for storing data.
Michael Vrable [Thu, 29 Jul 2010 22:43:10 +0000 (15:43 -0700)]
Dump cloud location of data items in debug output.
Michael Vrable [Thu, 29 Jul 2010 22:05:37 +0000 (15:05 -0700)]
(Mostly) merge local and cloud logging together.
Michael Vrable [Wed, 28 Jul 2010 19:05:05 +0000 (12:05 -0700)]
Preparatory work before implementing proper cloud writing.
Michael Vrable [Mon, 26 Jul 2010 03:19:13 +0000 (20:19 -0700)]
Some initial work on logging gathering data into cloud log segments.
This is still in progress, and needs to be better hooked in to cache
management as well as actually writing out a proper sequence of logs
instead of overwriting the same location each time. But it should have the
basics of gathering up data for dirty inodes into a segment and writing it.
Michael Vrable [Thu, 22 Jul 2010 21:51:37 +0000 (14:51 -0700)]
Initial work on cloud log-structured storage.
Right now this is just the first work towards tracking what objects are
stored where (a log in the cloud, in local memory, on local disk, etc.).
Michael Vrable [Tue, 20 Jul 2010 18:54:06 +0000 (11:54 -0700)]
Code cleanup.
Michael Vrable [Mon, 19 Jul 2010 22:31:40 +0000 (15:31 -0700)]
Add checksumming to filesystem journal.
This will be used to check for consistency during log recovery.
Michael Vrable [Mon, 19 Jul 2010 19:46:30 +0000 (12:46 -0700)]
Allow batched log writes when writing dirty inodes.
Michael Vrable [Sun, 18 Jul 2010 04:35:54 +0000 (21:35 -0700)]
Basic filesystem journaling.
Infrastructure for writing log entries synchronously (though the log format
is not yet finished and isn't yet useful), and a partial hook into the
BlueSky filesystem.
Michael Vrable [Thu, 15 Jul 2010 22:54:52 +0000 (15:54 -0700)]
Add synchronous inode logging in the NFS server.
This is in preparation for adding inode logging for data durability.
Michael Vrable [Thu, 15 Jul 2010 22:54:05 +0000 (15:54 -0700)]
Barriers did not handle requests that finished too quickly.
Michael Vrable [Wed, 14 Jul 2010 23:32:08 +0000 (16:32 -0700)]
Commit a few log benchmark results.
Michael Vrable [Wed, 14 Jul 2010 22:25:51 +0000 (15:25 -0700)]
Bugfix.
Michael Vrable [Wed, 14 Jul 2010 22:23:17 +0000 (15:23 -0700)]
Update commit log benchmarks.
Michael Vrable [Wed, 14 Jul 2010 05:25:39 +0000 (22:25 -0700)]
Make the log benchmark configurable and make a parameter sweep script.
Michael Vrable [Wed, 14 Jul 2010 00:30:50 +0000 (17:30 -0700)]
A new microbenchmark tool to figure out what format to use for logs.
We want to log filesystem operations to disk so they are persistent across
proxy crashes, but should do so in a manner that is relatively high
performance... Try to figure out what that should be.