aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorbnewbold <bnewbold@robocracy.org>2018-05-06 20:07:00 -0700
committerGitHub <noreply@github.com>2018-05-06 20:07:00 -0700
commit23fa355409a9d13da30e33886ad3de9ff2f093ef (patch)
tree6f73d6ecd320adfe16d107abe72c79031721483f
parent8e733153d1091d1228ab95654937a73b815c1371 (diff)
parentaede603f4cf2ef68cc5e0ade7c819f518d27f420 (diff)
downloaddat-deps-23fa355409a9d13da30e33886ad3de9ff2f093ef.tar.gz
dat-deps-23fa355409a9d13da30e33886ad3de9ff2f093ef.zip
Merge pull request #3 from bnewbold/dep-hyperdb
Draft: HyperDB
-rw-r--r--proposals/0003-hyperdb.md760
1 files changed, 760 insertions, 0 deletions
diff --git a/proposals/0003-hyperdb.md b/proposals/0003-hyperdb.md
new file mode 100644
index 0000000..1158e4e
--- /dev/null
+++ b/proposals/0003-hyperdb.md
@@ -0,0 +1,760 @@
+
+Title: **DEP-0003: Hyperdb**
+
+Short Name: `0003-hyperdb`
+
+Type: Standard
+
+Status: Draft (as of 2018-05-06)
+
+Github PR: [Draft](https://github.com/datprotocol/DEPs/pull/3)
+
+Authors:
+[Bryan Newbold](https://github.com/bnewbold),
+[Stephen Whitmore](https://github.com/noffle),
+[Mathias Buus](https://github.com/mafintosh)
+
+
+# Summary
+[summary]: #summary
+
+Hyperdb is an abstraction layer providing a general purpose distributed
+key/value store over the hypercore protocol. It is an iteration on the
+hyperdrive directory tree implementation, building on top of the hypercore
+append-only log abstraction layer. Keys are path-like strings (e.g.,
+`/food/fruit/kiwi`), and values are arbitrary binary blobs (generally under a
+megabyte).
+
+Hyperdrive (used by the Dat application) is expected to be re-implemented on
+top of hyperdb for improved performance with many files (e.g., millions). The
+hyperdrive API should be largely unchanged, but the `metadata` format will be
+backwards-incompatible.
+
+
+# Motivation
+[motivation]: #motivation
+
+Hyperdb is expected to drastically improve performance of dat clients when
+working with archives containing tens of thousands of files in single
+directories. This is a real-world bottleneck for several current users, with
+basic local actions such as adding a directory taking an unacceptably long time
+to complete.
+
+A secondary benefit is to refactor the [trie][trie]-structured key/value API
+out of hyperdrive, allowing third party code to build applications directly on
+this abstraction layer.
+
+[trie]: https://en.wikipedia.org/wiki/Trie
+
+
+# Usage Documentation
+[usage-documentation]: #usage-documentation
+
+*This section describes Hyperdb's interface and behavior in the abstract for
+application programmers. It is not intended to be exact documentation of any
+particular implementation (including the reference Javascript module).*
+
+Hyperdb is structured to be used much like a traditional hierarchical
+filesystem. A value can be written and read at locations like `/foo/bar/baz`,
+and the API supports querying or tracking values at subpaths, like how watching
+for changes on `/foo/bar` will report both changes to `/foo/bar/baz` and also
+`/foo/bar/19`.
+
+Lower-level details of the hypercore append-only log, disk serialization, and
+networked synchronization features that Hyperdb builds on top of are not
+described in detail here; see the [DEP repository][deps]. Multi-writer
+hypercore semantics are also not discussed in this DEP.
+
+[deps]: https://github.com/datprotocol/DEPs
+
+A Hyperdb database instance can be represented by a single hypercore feed (or
+several feeds in a multi-writer context), and is named, referenced, and
+discovered using the public and discovery keys of the hypercore feed (or the
+original feed if there are several). In a single-writer configuration, only a
+single node (holding the secret key) can mutate the database (e.g., via `put` or
+`delete` actions).
+
+**Keys** can be any UTF-8 string. Path segments are separated by the forward
+slash character (`/`). Repeated slashes (`//`) are disallowed. Leading and
+trailing `/` are optional in application code: `/hello` and `hello` are
+equivalent. A key can be both a "path segment" and key at the same time; e.g.,
+`/a/b/c` and `/a/b` can both be keys at the same time.
+
+**Values** can be any binary blob, including empty (of zero length). For
+example, values could be UTF-8 encoded strings, JSON encoded objects, protobuf
+messages, or a raw `uint64` integer (of either endian-ness). Length is the only
+form of type or metadata stored about the value; deserialization and validation
+are left to library and application developers.
+
+
+## Core API Semantics
+[core_api]: #core_api
+
+A `db` is instantiated by opening an existing hypercore with hyperdb content
+(read-only, or optionally read-write if the secret key is available), or
+creating a new one. A handle could represent any specific revision in history,
+or the "latest" revision.
+
+`db.put(key, value)`: inserts `value` (arbitrary bytes) under the path `key`.
+Requires read-write access. Returns an error (e.g., via callback) if there was a
+problem.
+
+`db.get(key)`: Reading a non-existent `key` is an error. Read-only.
+
+`db.delete(key)`: Removes the key from the database. Deleting a non-existent
+key is an error. Requires read-write access.
+
+`db.list(prefix)`: returns a flat (not nested) list of all keys currently in
+the database under the given prefix. Prefixes operate on a path-segment basis:
+`/ab` is not a valid prefix for key `/abcd`, but is valid for `/ab/cd`. If the
+prefix does not exist, returns an empty list. The order of returned keys is
+implementation (or configuration) specific. Default listing is recursive
+(implementations may have a flag to control this behavior). Read-only.
+
+If the hypercore underlying a hyperdb is only partially replicated, behavior is
+implementation-specific. For example, a `get()` call could block until the
+relevant value is replicated, or the implementation could return an error.
+
+An example pseudo-code session working with a database might be:
+
+ db.put('/life/animal/mammal/kitten', '{"cuteness": 500.3}')
+ db.put('/life/plant/bush/banana', '{"delicious": 103.4}')
+ db.delete('/life/plant/bush/banana')
+ db.put('/life/plant/tree/banana', '{"delicious": 103.4}')
+ db.get('/life/animal/mammal/kitten')
+ => {"cuteness": 500.3}
+ db.list('/life/')
+ => ['/life/animal/mammal/kitten', '/life/plant/tree/banana']
+
+
+# Reference Documentation
+[reference-documentation]: #reference-documentation
+
+A hyperdb hypercore feed typically consists of a sequence of protobuf-encoded
+messages of "Entry" type. Higher-level protocols may make exception to this,
+for example by prepending an application-specific metadata message as the first
+entry in the feed. There is sometimes a second "content" feed associated with
+the primary hyperdb key/value feed, to store data that does not fit in the
+(limited) `value` size constraint.
+
+The sequence of entries includes an incremental index: the most recent entry in
+the feed contains metadata pointers that can be followed to efficiently look up
+any key in the database without needing to linear scan the entire history or
+generate an independent index data structure. Implementations are, of course,
+free to maintain their own index if they prefer.
+
+The Entry protobuf message schema is:
+
+ message Entry {
+ message Feed {
+ required bytes key = 1;
+ }
+
+ required string key = 1;
+ optional bytes value = 2;
+ required bytes trie = 3;
+ repeated uint64 clock = 4;
+ optional uint64 inflate = 5;
+ repeated Feed feeds = 6;
+ optional bytes contentFeed = 7;
+ }
+
+Some fields are are specific to the multi-writer features described in their
+own DEP and mentioned only partially here. The fields common to both message
+types are:
+
+- `key`: UTF-8 key that this node describes. Leading and trailing forward
+ slashes (`/`) are always stripped before storing in protobuf.
+- `value`: arbitrary byte array. A non-existent `value` entry indicates that
+ this Entry indicates a deletion of the key; this is distinct from specifying
+ an empty (zero-length) value.
+- `trie`: a structured array of pointers to other Entry entries in the feed,
+ used for navigating the tree of keys.
+- `clock`: reserved for use in the forthcoming `multi-writer` standard. An
+ empty list is the safe (and expected) value for `clock` in single-writer use
+ cases.
+- `inflate`: a "pointer" (reference to a feed index number) to the most recent
+ entry in the feed that has the optional `feeds` and `contentFeed` fields set.
+ Not defined if `feeds` and `contentFeed` are defined in the same message.
+- `feeds`: reserved for use with `multi-writer`. The safe single-writer value is
+ to use the current feed's hypercore public key.
+- `contentFeed`: for applications which require a parallel "content" hypercore
+ feed for larger data, this field can be used to store the 32-byte public key
+ for that feed. This field should have a single value for the entire history
+ of the feed (aka, it is not mutable).
+
+For the case of a single-writer feed, not using multi-writer features, it is
+sufficient to write a single Entry message as the first entry in the hypercore
+feed, with `feeds` containing a single entry (a pointer to the current feed
+itself), and `contentFeed` optionally set to a pointer to a paired content
+feed.
+
+If *either* `feeds` *or* `contentFeed` are defined in an entry, *both* fields
+must be defined, so the new message can be referred to via `inflated`. In this
+case the entry is refereed to as an "inflated entry".
+
+
+## Path Hashing
+[path_hashing]: #path_hashing
+
+Every key path has component-wise fixed-size hash representation that is used
+by the trie. The concatenation of all path hashes yields a "path hash array"
+for the entire key. Note that analogously to a hash map data structure, there
+can be more than one key (string) with the same key hash in the same database
+with no problems: the hash points to a linked-list "bucket" of Entries, which
+can be iterated over linearly to find the correct value.
+
+The path hash is represented by an array of bytes. Elements are 2-bit encoded
+(values 0, 1, 2, 3), except for an optional terminating element which has value
+4. Each path element consists of 32 values, representing a 64-bit hash of that
+path element. For example, the key `/tree/willow` has two path segments (`tree`
+and `willow`), and will be represented by a 65 element path hash array (two 32
+element hashes plus a terminator).
+
+The hash algorithm used is `SipHash-2-4`, with an 8-byte output and
+16-byte key; the input is the UTF-8 encoded path string segment, without any
+`/` separators or terminating null bytes. An implementation of this hash
+algorithm is included in the libsodium library in the form of the
+`crypto_shorthash()` function. A 16-byte "secret" key is required; for this use
+case we use all zeros.
+
+When converting the 8-bytehash to an array of 2-bit bytes, the ordering is
+proceed byte-by-byte, and for each take the two lowest-value bits (aka, `hash &
+0x3`) as byte index `0`, the next two bits (aka, `hash & 0xC`) as byte index
+`1`, etc. When concatenating path hashes into a longer array, the first
+("left-most") path element hash will correspond to byte indexes 0 through 31;
+the terminator (`4`) will have the highest byte index.
+
+For example, consider the key `/tree/willow`. `tree` has a hash `[0xAC, 0xDC,
+0x05, 0x6C, 0x63, 0x9D, 0x87, 0xCA]`, which converts into the array:
+
+```
+[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3 ]
+```
+
+
+`willow` has a 64-bit hash `[0x72, 0x30, 0x34, 0x39, 0x35, 0xA8, 0x21, 0x44]`,
+which converts into the array:
+
+```
+[ 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1 ]
+```
+
+These combine into the unified byte array with 65 elements:
+
+```
+[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3,
+ 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1,
+ 4 ]
+```
+
+As another example, the key `/a/b/c` converts into the 97-byte hash array:
+
+```
+[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
+ 0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3,
+ 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0,
+ 4 ]
+```
+
+Note that "hash collisions" are rare with this hashing scheme, but are likely
+to occur with large databases (millions of keys), and collision have been
+generated as a proof of concept. Implementations should take care to properly
+handle collisions by verifying keys and following bucket pointers (see the next
+section).
+
+An example hash collision (useful for testing; thanks to Github user
+`dcposch`):
+
+ /mpomeiehc
+ /idgcmnmna
+
+<!--
+
+Generation code (javascript) for the above:
+
+ var sodium = require('sodium-universal')
+ var toBuffer = require('to-buffer')
+
+ var KEY = Buffer.alloc(16)
+ var OUT = Buffer.alloc(8)
+
+ sodium.crypto_shorthash(OUT, toBuffer('tree'), KEY)
+ console.log("tree: ", OUT)
+ console.log(hash('tree', true))
+
+ sodium.crypto_shorthash(OUT, toBuffer('willow'), KEY)
+ console.log("willow: ", OUT)
+ console.log(hash('willow', true))
+
+ sodium.crypto_shorthash(OUT, toBuffer('a'), KEY)
+ console.log("a: ", OUT)
+ console.log(hash('a', true))
+
+Then, to manually "expand" arrays in python3:
+
+ hash_array = [0x72, 0x30, 0x34, 0x39, 0x35, 0xA8, 0x21, 0x44]
+ path = []
+ tmp = [(x & 0x3, (x >> 2) & 0x3, (x >> 4) & 0x3, (x >> 6) & 0x3) for x in hash_array]
+ [path.extend(e) for e in tmp]
+ path
+
+-->
+
+
+## Incremental Index Trie
+[trie_index]: #trie_index
+
+Each node stores a *prefix [trie](https://en.wikipedia.org/wiki/Trie)* that
+can be used to look up other keys, or to list all keys with a given prefix.
+This is stored under the `trie` field of the Entry protobuf message.
+
+The trie effectively mirrors the path hash array. Each element in the `trie`
+is called a "bucket". Each non-empty bucket points to the newest Entries which
+have an identical path up to that specific prefix location; because the trie
+has 4 "values" at each node, there can be pointers to up to 3 other values at a
+given element in the trie array. Buckets can be empty if there are no nodes
+with path hashes that differ for the first time the given bucket (aka, there
+are no "branches" at this node in the trie). Only non-null elements will be
+transmitted or stored on disk.
+
+The data structure of the trie is a sparse array of pointers to other Entry
+entries. Each pointer indicates a feed index and an entry index pointer, and is
+associated with a 2-bit value; for the non-multi-writer case, the feed index is
+always 0, so we consider only the entry index.
+
+For a `trie` with `N` buckets, each may have zero or more pointers. Typically
+there are a maximum of 3 pointers per bucket, corresponding to the 4 possible
+values minus the current Entry's value, but in the event of hash collisions (in
+the path array space), there may be multiple pointers in the same bucket
+corresponding to the same value.
+
+To lookup a key in the database, the recipe is to:
+
+1. Calculate the path hash array for the key you are looking for.
+2. Select the most-recent ("latest") Entry for the feed.
+3. Compare path hash arrays. If the paths match exactly, compare keys; they
+ match, you have found the you were looking for! Check whether the `value` is
+ defined or not; if not, this Entry represents that the key was deleted from
+ the database.
+4. If the paths match, but not the keys, look for a pointer in the last `trie`
+ array index, and iterate from step #3 with the new Entry.
+5. If the paths don't entirely match, find the first index at which the two
+ arrays differ, and look up the corresponding element in this Entry's `trie`
+ array. If the element is empty, or doesn't have a pointer corresponding to
+ your 2-bit value, then your key does not exist in this hyperdb.
+6. If the trie element is not empty, then follow that pointer to select the
+ next Entry. Recursively repeat this process from step #3; you will be
+ descending the `trie` in a search, and will either terminate in the Entry you
+ are looking for, or find that the key is not defined in this hyperdb.
+
+Similarly, to write a key to the database:
+
+1. Calculate the path hash array for the key, and start with an empty `trie` of
+ the same length; you will write to the `trie` array from the current index,
+ which starts at 0.
+2. Select the most-recent ("latest") Entry for the feed.
+3. Compare path hash arrays. If the paths match exactly, and the keys match, then
+ you are overwriting the current Entry, and can copy the "remainder" of it's
+ `trie` up to your current `trie` index.
+4. If the paths match, but not the keys, you are adding a new key to an
+ existing hash bucket. Copy the `trie` and extend it to the full length. Add
+ a pointer to the Entry with the same hash at the final array index.
+5. If the paths don't entirely match, find the first index at which the two
+ arrays differ. Copy all `trie` elements (empty or not) into the new `trie`
+ for indices between the "current index" and the "differing index".
+6. Next look up the corresponding element in this Entry's `trie` array at the
+ differing index. If this element is empty, then you have found the most
+ similar Entry. Write a pointer to this node to the `trie` at
+ the differing index, and you are done (all remaining `trie` elements are
+ empty, and can be omitted).
+7. If the differing tree element has a pointer (is not empty), then follow that
+ pointer to select the next `Entry`. Recursively repeat this process from step
+ #3.
+
+To delete a value, follow the same procedure as adding a key, but write the
+`Entry` without a `value` (in protobuf, this is distinct from having a `value`
+field with zero bytes). Deletion nodes may persist in the database forever.
+
+
+## Binary Trie Encoding
+[trie_encoding]: #trie_encoding
+
+The following scheme is used to encode trie data structures (sparse, indexed
+arrays of pointers to entries) into a variable-length bytestring as the `trie`
+field of an Entry protobuf message.
+
+Consider a trie array with `N` buckets and `M` non-empty buckets (`0 <= M <=
+N`). In the encoded field, there will be `M` concatenated bytestrings of the
+form:
+
+- trie index (varint), followed by...
+- bucket bitfield (packed in a varint), encoding which of the 5 values (4
+ values if the index is not modulo 32) at this node of the trie have pointers,
+ followed by one or more...
+- pointer sets, each referencing an entry at (feed index, entry index):
+ - feed index (varint, with a extra "more pointers at this value" low bit,
+ encoded as `feed_index << 1 | more_bit`)
+ - entry index (varint)
+
+In the common case for a small/sparse hyperdb, there will a small number of
+non-empty buckets (small `M`), a usually a single `(feed index, entry index)`
+pointer for those non-empty buckets. For a very large/dense hyperdb (millions
+of key/value pairs), there will be many non-empty buckets (`M` approaching
+`N`), and buckets may have up to the full 4 pointer sets. Even with millions of
+entries, hash collisions will be very rare; in those cases there will be
+multiple pointers in the same pointer set.
+
+Consider an entry with path hash:
+
+```
+[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1,
+ 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1,
+ 4 ]
+```
+
+and trie:
+
+```
+[ , { val: 1, feed: 0, index: 1 } ]
+```
+
+In this case `N` is 64 (or you could count as 2 if you ignore trailing empty
+entries) and `M` is 1. There will be a single bytestring chunk:
+
+- index varint is `1` (second element in the trie array)
+- bitfield is `0b0010` (varint 2): there is only a pointer set for value 1 (the second value)
+- there is a single pointer in the pointer set, which is (`feed=0 << 1 | 0`,
+ `index=1`), or (varint 2, varint 1)
+
+Combined, the `trie` bytestring will be:
+
+```
+[0x01, 0x02, 0x02, 0x02]
+```
+
+For a more complex example, consider the same path hash, but the trie:
+
+```
+[ , { val: 1, feed: 0, index: 1; val: 2, feed: 5, index: 3; val: 3, feed: 6, index: 98 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
+ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
+ { val: 4, feed: 0, index, 23; val: 4, feed: 1, index: 17 ]
+```
+
+Now `M` is 2. The first bytestring chunk will be:
+
+- index varint is `1` (second element in the trie array)
+- bitfield is `0b01110` (varint 9): there are three pointer sets
+- first pointer set is (`feed=0 << 1 | 0`, `index=1`) or (varint 2, varint 1)
+- second pointer set is (`feed=5 << 1 | 0`, `index=3`) or (varint 10, varint 3)
+- third pointer set is (`feed=6 << 1 | 0`, `index=98`) or (varint 12, varint 98)
+
+the second bytestring chunk would be:
+
+- index varint is `64` (65th element in the trie array; the terminating value)
+- bitfield is `0b10000` (varint 1); there is a single pointer set... but it
+ contains a hash collision, so there are two pointers
+- first pointer is (`feed=0 << 1 | 1`, `index=23`) or (varint 1, varint=23);
+ note the `more` bit is set high!
+- second pointer is (`feed=1 << 1 | 0`, `index=17`) or (varint 2, varint 17);
+ note the `more` bit is low, as usual. In the extremely unlikely case of
+ multiple collisions there could have been more pointers with `more` high
+ preceding this one.
+
+The overall bytestring would be:
+
+```
+[0x01, 0x09, 0x02, 0x01, 0x0A, 0x03, 0x0C, 0x62, 0x40, 0x10, 0x01, 0x17, 0x02, 0x11]
+```
+
+
+# Examples
+[examples]: #examples
+
+
+## Simple Put and Get
+
+Starting with an empty hyperdb `db`, if we `db.put('/a/b', '24')` we expect to
+see a single `Entry` and index 0:
+
+```
+{ key: 'a/b',
+ value: '24',
+ trie:
+ [ ] }
+```
+
+For reference, the path hash array for this key (index 0) is:
+
+```
+[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
+ 0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3,
+ 4 ]
+```
+
+Note that the first 64 bytes in path match those of the `/a/b/c` example from
+the [path hashing][path_hash] section, because the first two path components
+are the same. Since this is the first entry, the entry index is 0.
+
+Now we `db.put('/a/c', 'hello')` and expect a second Entry:
+
+```
+{ key: 'a/c',
+ value: 'hello',
+ trie:
+ [ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
+ , , { element: 2, feed: 0, index: 0 } ] }
+```
+
+The path hash array for this key (index 1) is:
+
+```
+[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
+ 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0,
+ 4 ]
+```
+
+The first 32 characters of path are common with the first Entry (they share a
+common prefix, `/a`).
+
+`trie` is defined, but mostly sparse. The first 32 elements of common prefix
+match the first Entry, and then two additional hash elements (`[0, 1]`) happen
+to match as well; there is not a differing entry until index 34 (zero-indexed).
+At this entry there is a reference pointing to the first Entry. An additional 29
+trailing null entries have been trimmed in reduce metadata overhead.
+
+Next we insert a third node with `db.put('/x/y', 'other')`, and get a third Entry:
+
+```
+{ key: 'x/y',
+ value: 'other',
+ trie:
+ [ , { val: 1, feed: 0, index: 1 } ],
+```
+
+The path hash array for this key (index 2) is:
+
+```
+[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1,
+ 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1,
+ 4 ]
+```
+
+Consider the lookup-up process for `db.get('/a/b')` (which we expect to
+successfully return `'24'`, as written in the first Entry). First we calculate
+the path for the key `a/b`, which will be the same as the first Entry. Then we
+take the "latest" Entry, with entry index 2. We compare the path hash arrays,
+starting at the first element, and find the first difference at index 1 (`1 ==
+1`, then `1 != 2`). We look at index 1 in the current Entry's `trie` and find a
+pointer to entry index 1, so we fetch that Entry and recurse. Comparing path
+hash arrays, we now get all the way to index 34 before there is a difference.
+We again look in the `trie`, find a pointer to entry index 0, and fetch the
+first Entry and recurse. Now the path elements match exactly; we have found the
+Entry we are looking for, and it has an existent `value`, so we return the
+`value`.
+
+Consider a lookup for `db.get('/a/z')`; this key does not exist, so we expect
+to return with "key not found". We calculate the path hash array for this key:
+
+```
+[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
+ 1, 2, 3, 0, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 0, 3, 3, 2, 0, 3, 3, 1, 1, 0, 2, 1, 0, 1, 1, 2,
+ 4 ]
+```
+
+Similar to the first lookup, we start with entry index 2 and follow the pointer to
+entry index 1. This time, when we compare path hash arrays, the first differing
+entry is at array index `32`. There is no `trie` entry at this index, which
+tells us that the key does not exist in the database.
+
+## Listing a Prefix
+
+Continuing with the state of the database above, we call `db.list('/a')` to
+list all keys with the prefix `/a`.
+
+We generate a path hash array for the key `/a`, without the terminating symbol
+(`4`):
+
+```
+[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2 ]
+```
+
+Using the same process as a `get()` lookup, we find the first Entry that
+entirely matches this prefix, which will be entry index 1. If we had failed to
+find any Entry with a complete prefix match, then we would return an empty list
+of matching keys.
+
+Starting with the first prefix-matching node, we save that key as a match
+(unless the Entry is a deletion), then select all `trie` pointers with an index
+higher than the prefix length, and recursively inspect all pointed-to Entries.
+
+## Deleting a Key
+
+Continuing with the state of the database above, we call `db.delete('/a/c')` to
+remove that key from the database.
+
+The process is almost entirely the same as inserting a new Entry at that key,
+except that the `value` field is undefined. The new Entry (at entry index 3)
+is:
+
+```
+{ key: 'a/c',
+ value: ,
+ trie: [ , { val: 1, feed: 0, index: 2 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
+ , , { val: 1, feed: 0, index: 0 } ] }
+```
+
+The path hash array for this Entry (key) is:
+
+```
+[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
+ 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0,
+ 4 ]
+```
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+A backwards-incompatible change will have negative effects on the broader dat
+ecosystem: clients will need to support both versions protocol for some time
+(increasing maintenance burden), future clients may not inter-operate with old
+archives, etc. These downsides can partially be avoided by careful roll-out.
+
+For the specific use case of Dat archives, hyperdb will trivially increase
+metadata size (and thus disk and network consumption) for archives with few
+files.
+
+
+# Overhead and Scaling
+[overhead]: #overhead
+
+The metadata overhead for a single database entry varies based on the size of
+the database. In a "heavy" case, considering a two-path-segment key with an
+entirely saturated `trie` and `uint32`-sized feed and entry index pointers, and
+ignoring multi-writer fields:
+
+- `trie`: 4 * 2 * 64 bytes = 512 bytes
+- total: 512 bytes
+
+In a "light" case, with few `trie` entries and single-byte varint feed and
+entry index pointers:
+
+- `trie`: 2 * 2 * 4 bytes = 16 bytes
+- total: 16
+
+For a database with most keys having N path segments, the cost of a `get()`
+scales with the number of entries M as `O(log(M))` with best case 1 lookup and
+worst case `4 * 32 * N = 128 * N` lookups (for a saturated `trie`).
+
+The cost of a `put()` or `delete()` is proportional to the cost of a `get()`.
+
+The cost of a `list()` is linear (`O(M)`) in the number of matching entries,
+plus the cost of a single `get()`.
+
+The total metadata overhead for a database with M entries scales with `O(M
+* log(M))`.
+
+
+# Privacy and Security Considerations
+[privacy]: #privacy
+
+The basic key/value semantics of hyperdb (as discussed in this DEP, not
+considering multi-writer changes) are not known to introduce new privacy issues
+when compared with, e.g., storing binary values at key-like paths using the
+current ("legacy") hyperdrive system.
+
+A malicious writer could cause trouble for readers, even readers who do not
+trust the application-level contents of a feed. Implementations which may be
+exposed to arbitrary feeds from unknown sources on the internet should take
+care to the following scenarios: A malicious writer may be able to produce very
+frequent key path hash collisions, which could degrade to linear performance. A
+malicious writer could send broken trie structures that contain pointer loops,
+duplicate pointers, or other invalid contents. A malicious writer could write
+arbitrary data in value fields in an attempt to exploit de-serialization bugs.
+A malicious writer could leverage non-printing unicode characters to create
+database entries with user-indistinguishable names (keys).
+
+
+# Rationale and alternatives
+[alternatives]: #alternatives
+
+A major motivator for hyperdb is to improve scaling performance with tens of
+thousands through millions of files per directory in the existing hyperdrive
+implementation. The current implementation requires the most recent node in a
+directory to point to all other nodes in the directory. Even with pointer
+compression, this requires on the order of `O(N^2)` bytes; the hyperdb
+implementation scales with `O(N log(N))`.
+
+The hyperdb specification (this document) is complicated by the inclusion of
+new protobuf fields to support "multi-writer" features which are not described
+here. The motivation to include these fields now to make only a single
+backwards-incompatible schema change, and to make a second software-only change
+in the future to enable support for these features. Schema and data format
+changes are considered significantly more "expensive" for the community and
+software ecosystem compared to software-only changes. Attempts have been made
+in this specification to indicate the safe "single-writer-only" values to use
+for these fields.
+
+
+# Dat migration logistics
+[migration]: #migration
+
+Hyperdb is not backwards compatible with the existing hyperdrive metadata,
+meaning dat clients may need to support both versions during a transition
+period. This applies both to archives saved to disk (e,g., in SLEEP) and to
+archives received and published to peers over the network.
+
+No changes to the Dat network wire protocol itself are necessary, only changes
+to content passed over the protocol. The Dat `content` feed, containing raw
+file data, is not impacted by hyperdb, only the contents of the `metadata`
+feed.
+
+Upgrading a Dat (hyperdrive) archive to hyperdb will necessitate creating a new
+feed from scratch, meaning new public/private key pairs, and that public key
+URL links will need to change.
+
+Further logistical details are left to the forthcoming Multi-Writer DEP.
+
+
+# Unresolved questions
+[unresolved]: #unresolved-questions
+
+Need to think through deletion process with respect to listing a path prefix;
+will previously deleted nodes be occluded, or potentially show up in list
+results? Should be reviewed (by a non-author of this document) before accepted
+as a Draft.
+
+Can the deletion process (currently leaving "tombstone" entries in the `trie`
+forever) be improved, such that these entries don't need to be iterated over?
+mafintosh mentioned this might be in the works. Does this DEP need to "leave
+room" for those changes, or should we call out the potential for future change?
+Probably not, should only describe existing solutions. This can be resolved
+after Draft.
+
+There are implied "reasonable" limits on the size (in bytes) of both keys and
+values, but they are not formally specified. Protobuf messages have a hard
+specified limit of 2 GByte (due to 32-bit signed arithmetic), and most
+implementations impose a (configurable) 64 MByte limit. Should this DEP impose
+specific limits on key and value sizes? Would be good to decide before Draft
+status.
+
+Apart from leaving fields in the protobuf message specification, multi-writer
+concerns are explicitly out of scope for this DEP.
+
+
+# Changelog
+[changelog]: #changelog
+
+As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb
+nodejs module on [github](https://github.com/mafintosh/hyperdb), which is the
+basis for this DEP.
+
+- 2017-12-06: Stephen Whitmore (@noffle) publishes `ARCHITECTURE.md` overview
+ in the [hyperdb github repo][arch_md]
+- 2018-03-04: First draft for review
+- 2018-03-15: Hyperdb v3.0.0 is released
+- 2018-04-18: This DEP submitted for Draft review.
+- 2018-05-06: Merged as Draft after WG approval.
+
+[arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md