From 1e5933b2f940f66962de576ef4f06ed9f4f30507 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 6 May 2018 20:12:05 -0700 Subject: fix numbering race condition ... quick, before anybody notices! :) DEP-0003 (HTTP Pinning Service API) was merged weeks ago, but I hadn't merged to my local repo. --- proposals/0003-hyperdb.md | 760 ---------------------------------------------- proposals/0004-hyperdb.md | 760 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 760 insertions(+), 760 deletions(-) delete mode 100644 proposals/0003-hyperdb.md create mode 100644 proposals/0004-hyperdb.md diff --git a/proposals/0003-hyperdb.md b/proposals/0003-hyperdb.md deleted file mode 100644 index 1158e4e..0000000 --- a/proposals/0003-hyperdb.md +++ /dev/null @@ -1,760 +0,0 @@ - -Title: **DEP-0003: Hyperdb** - -Short Name: `0003-hyperdb` - -Type: Standard - -Status: Draft (as of 2018-05-06) - -Github PR: [Draft](https://github.com/datprotocol/DEPs/pull/3) - -Authors: -[Bryan Newbold](https://github.com/bnewbold), -[Stephen Whitmore](https://github.com/noffle), -[Mathias Buus](https://github.com/mafintosh) - - -# Summary -[summary]: #summary - -Hyperdb is an abstraction layer providing a general purpose distributed -key/value store over the hypercore protocol. It is an iteration on the -hyperdrive directory tree implementation, building on top of the hypercore -append-only log abstraction layer. Keys are path-like strings (e.g., -`/food/fruit/kiwi`), and values are arbitrary binary blobs (generally under a -megabyte). - -Hyperdrive (used by the Dat application) is expected to be re-implemented on -top of hyperdb for improved performance with many files (e.g., millions). The -hyperdrive API should be largely unchanged, but the `metadata` format will be -backwards-incompatible. - - -# Motivation -[motivation]: #motivation - -Hyperdb is expected to drastically improve performance of dat clients when -working with archives containing tens of thousands of files in single -directories. This is a real-world bottleneck for several current users, with -basic local actions such as adding a directory taking an unacceptably long time -to complete. - -A secondary benefit is to refactor the [trie][trie]-structured key/value API -out of hyperdrive, allowing third party code to build applications directly on -this abstraction layer. - -[trie]: https://en.wikipedia.org/wiki/Trie - - -# Usage Documentation -[usage-documentation]: #usage-documentation - -*This section describes Hyperdb's interface and behavior in the abstract for -application programmers. It is not intended to be exact documentation of any -particular implementation (including the reference Javascript module).* - -Hyperdb is structured to be used much like a traditional hierarchical -filesystem. A value can be written and read at locations like `/foo/bar/baz`, -and the API supports querying or tracking values at subpaths, like how watching -for changes on `/foo/bar` will report both changes to `/foo/bar/baz` and also -`/foo/bar/19`. - -Lower-level details of the hypercore append-only log, disk serialization, and -networked synchronization features that Hyperdb builds on top of are not -described in detail here; see the [DEP repository][deps]. Multi-writer -hypercore semantics are also not discussed in this DEP. - -[deps]: https://github.com/datprotocol/DEPs - -A Hyperdb database instance can be represented by a single hypercore feed (or -several feeds in a multi-writer context), and is named, referenced, and -discovered using the public and discovery keys of the hypercore feed (or the -original feed if there are several). In a single-writer configuration, only a -single node (holding the secret key) can mutate the database (e.g., via `put` or -`delete` actions). - -**Keys** can be any UTF-8 string. Path segments are separated by the forward -slash character (`/`). Repeated slashes (`//`) are disallowed. Leading and -trailing `/` are optional in application code: `/hello` and `hello` are -equivalent. A key can be both a "path segment" and key at the same time; e.g., -`/a/b/c` and `/a/b` can both be keys at the same time. - -**Values** can be any binary blob, including empty (of zero length). For -example, values could be UTF-8 encoded strings, JSON encoded objects, protobuf -messages, or a raw `uint64` integer (of either endian-ness). Length is the only -form of type or metadata stored about the value; deserialization and validation -are left to library and application developers. - - -## Core API Semantics -[core_api]: #core_api - -A `db` is instantiated by opening an existing hypercore with hyperdb content -(read-only, or optionally read-write if the secret key is available), or -creating a new one. A handle could represent any specific revision in history, -or the "latest" revision. - -`db.put(key, value)`: inserts `value` (arbitrary bytes) under the path `key`. -Requires read-write access. Returns an error (e.g., via callback) if there was a -problem. - -`db.get(key)`: Reading a non-existent `key` is an error. Read-only. - -`db.delete(key)`: Removes the key from the database. Deleting a non-existent -key is an error. Requires read-write access. - -`db.list(prefix)`: returns a flat (not nested) list of all keys currently in -the database under the given prefix. Prefixes operate on a path-segment basis: -`/ab` is not a valid prefix for key `/abcd`, but is valid for `/ab/cd`. If the -prefix does not exist, returns an empty list. The order of returned keys is -implementation (or configuration) specific. Default listing is recursive -(implementations may have a flag to control this behavior). Read-only. - -If the hypercore underlying a hyperdb is only partially replicated, behavior is -implementation-specific. For example, a `get()` call could block until the -relevant value is replicated, or the implementation could return an error. - -An example pseudo-code session working with a database might be: - - db.put('/life/animal/mammal/kitten', '{"cuteness": 500.3}') - db.put('/life/plant/bush/banana', '{"delicious": 103.4}') - db.delete('/life/plant/bush/banana') - db.put('/life/plant/tree/banana', '{"delicious": 103.4}') - db.get('/life/animal/mammal/kitten') - => {"cuteness": 500.3} - db.list('/life/') - => ['/life/animal/mammal/kitten', '/life/plant/tree/banana'] - - -# Reference Documentation -[reference-documentation]: #reference-documentation - -A hyperdb hypercore feed typically consists of a sequence of protobuf-encoded -messages of "Entry" type. Higher-level protocols may make exception to this, -for example by prepending an application-specific metadata message as the first -entry in the feed. There is sometimes a second "content" feed associated with -the primary hyperdb key/value feed, to store data that does not fit in the -(limited) `value` size constraint. - -The sequence of entries includes an incremental index: the most recent entry in -the feed contains metadata pointers that can be followed to efficiently look up -any key in the database without needing to linear scan the entire history or -generate an independent index data structure. Implementations are, of course, -free to maintain their own index if they prefer. - -The Entry protobuf message schema is: - - message Entry { - message Feed { - required bytes key = 1; - } - - required string key = 1; - optional bytes value = 2; - required bytes trie = 3; - repeated uint64 clock = 4; - optional uint64 inflate = 5; - repeated Feed feeds = 6; - optional bytes contentFeed = 7; - } - -Some fields are are specific to the multi-writer features described in their -own DEP and mentioned only partially here. The fields common to both message -types are: - -- `key`: UTF-8 key that this node describes. Leading and trailing forward - slashes (`/`) are always stripped before storing in protobuf. -- `value`: arbitrary byte array. A non-existent `value` entry indicates that - this Entry indicates a deletion of the key; this is distinct from specifying - an empty (zero-length) value. -- `trie`: a structured array of pointers to other Entry entries in the feed, - used for navigating the tree of keys. -- `clock`: reserved for use in the forthcoming `multi-writer` standard. An - empty list is the safe (and expected) value for `clock` in single-writer use - cases. -- `inflate`: a "pointer" (reference to a feed index number) to the most recent - entry in the feed that has the optional `feeds` and `contentFeed` fields set. - Not defined if `feeds` and `contentFeed` are defined in the same message. -- `feeds`: reserved for use with `multi-writer`. The safe single-writer value is - to use the current feed's hypercore public key. -- `contentFeed`: for applications which require a parallel "content" hypercore - feed for larger data, this field can be used to store the 32-byte public key - for that feed. This field should have a single value for the entire history - of the feed (aka, it is not mutable). - -For the case of a single-writer feed, not using multi-writer features, it is -sufficient to write a single Entry message as the first entry in the hypercore -feed, with `feeds` containing a single entry (a pointer to the current feed -itself), and `contentFeed` optionally set to a pointer to a paired content -feed. - -If *either* `feeds` *or* `contentFeed` are defined in an entry, *both* fields -must be defined, so the new message can be referred to via `inflated`. In this -case the entry is refereed to as an "inflated entry". - - -## Path Hashing -[path_hashing]: #path_hashing - -Every key path has component-wise fixed-size hash representation that is used -by the trie. The concatenation of all path hashes yields a "path hash array" -for the entire key. Note that analogously to a hash map data structure, there -can be more than one key (string) with the same key hash in the same database -with no problems: the hash points to a linked-list "bucket" of Entries, which -can be iterated over linearly to find the correct value. - -The path hash is represented by an array of bytes. Elements are 2-bit encoded -(values 0, 1, 2, 3), except for an optional terminating element which has value -4. Each path element consists of 32 values, representing a 64-bit hash of that -path element. For example, the key `/tree/willow` has two path segments (`tree` -and `willow`), and will be represented by a 65 element path hash array (two 32 -element hashes plus a terminator). - -The hash algorithm used is `SipHash-2-4`, with an 8-byte output and -16-byte key; the input is the UTF-8 encoded path string segment, without any -`/` separators or terminating null bytes. An implementation of this hash -algorithm is included in the libsodium library in the form of the -`crypto_shorthash()` function. A 16-byte "secret" key is required; for this use -case we use all zeros. - -When converting the 8-bytehash to an array of 2-bit bytes, the ordering is -proceed byte-by-byte, and for each take the two lowest-value bits (aka, `hash & -0x3`) as byte index `0`, the next two bits (aka, `hash & 0xC`) as byte index -`1`, etc. When concatenating path hashes into a longer array, the first -("left-most") path element hash will correspond to byte indexes 0 through 31; -the terminator (`4`) will have the highest byte index. - -For example, consider the key `/tree/willow`. `tree` has a hash `[0xAC, 0xDC, -0x05, 0x6C, 0x63, 0x9D, 0x87, 0xCA]`, which converts into the array: - -``` -[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3 ] -``` - - -`willow` has a 64-bit hash `[0x72, 0x30, 0x34, 0x39, 0x35, 0xA8, 0x21, 0x44]`, -which converts into the array: - -``` -[ 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1 ] -``` - -These combine into the unified byte array with 65 elements: - -``` -[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3, - 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1, - 4 ] -``` - -As another example, the key `/a/b/c` converts into the 97-byte hash array: - -``` -[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, - 0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3, - 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, - 4 ] -``` - -Note that "hash collisions" are rare with this hashing scheme, but are likely -to occur with large databases (millions of keys), and collision have been -generated as a proof of concept. Implementations should take care to properly -handle collisions by verifying keys and following bucket pointers (see the next -section). - -An example hash collision (useful for testing; thanks to Github user -`dcposch`): - - /mpomeiehc - /idgcmnmna - - - - -## Incremental Index Trie -[trie_index]: #trie_index - -Each node stores a *prefix [trie](https://en.wikipedia.org/wiki/Trie)* that -can be used to look up other keys, or to list all keys with a given prefix. -This is stored under the `trie` field of the Entry protobuf message. - -The trie effectively mirrors the path hash array. Each element in the `trie` -is called a "bucket". Each non-empty bucket points to the newest Entries which -have an identical path up to that specific prefix location; because the trie -has 4 "values" at each node, there can be pointers to up to 3 other values at a -given element in the trie array. Buckets can be empty if there are no nodes -with path hashes that differ for the first time the given bucket (aka, there -are no "branches" at this node in the trie). Only non-null elements will be -transmitted or stored on disk. - -The data structure of the trie is a sparse array of pointers to other Entry -entries. Each pointer indicates a feed index and an entry index pointer, and is -associated with a 2-bit value; for the non-multi-writer case, the feed index is -always 0, so we consider only the entry index. - -For a `trie` with `N` buckets, each may have zero or more pointers. Typically -there are a maximum of 3 pointers per bucket, corresponding to the 4 possible -values minus the current Entry's value, but in the event of hash collisions (in -the path array space), there may be multiple pointers in the same bucket -corresponding to the same value. - -To lookup a key in the database, the recipe is to: - -1. Calculate the path hash array for the key you are looking for. -2. Select the most-recent ("latest") Entry for the feed. -3. Compare path hash arrays. If the paths match exactly, compare keys; they - match, you have found the you were looking for! Check whether the `value` is - defined or not; if not, this Entry represents that the key was deleted from - the database. -4. If the paths match, but not the keys, look for a pointer in the last `trie` - array index, and iterate from step #3 with the new Entry. -5. If the paths don't entirely match, find the first index at which the two - arrays differ, and look up the corresponding element in this Entry's `trie` - array. If the element is empty, or doesn't have a pointer corresponding to - your 2-bit value, then your key does not exist in this hyperdb. -6. If the trie element is not empty, then follow that pointer to select the - next Entry. Recursively repeat this process from step #3; you will be - descending the `trie` in a search, and will either terminate in the Entry you - are looking for, or find that the key is not defined in this hyperdb. - -Similarly, to write a key to the database: - -1. Calculate the path hash array for the key, and start with an empty `trie` of - the same length; you will write to the `trie` array from the current index, - which starts at 0. -2. Select the most-recent ("latest") Entry for the feed. -3. Compare path hash arrays. If the paths match exactly, and the keys match, then - you are overwriting the current Entry, and can copy the "remainder" of it's - `trie` up to your current `trie` index. -4. If the paths match, but not the keys, you are adding a new key to an - existing hash bucket. Copy the `trie` and extend it to the full length. Add - a pointer to the Entry with the same hash at the final array index. -5. If the paths don't entirely match, find the first index at which the two - arrays differ. Copy all `trie` elements (empty or not) into the new `trie` - for indices between the "current index" and the "differing index". -6. Next look up the corresponding element in this Entry's `trie` array at the - differing index. If this element is empty, then you have found the most - similar Entry. Write a pointer to this node to the `trie` at - the differing index, and you are done (all remaining `trie` elements are - empty, and can be omitted). -7. If the differing tree element has a pointer (is not empty), then follow that - pointer to select the next `Entry`. Recursively repeat this process from step - #3. - -To delete a value, follow the same procedure as adding a key, but write the -`Entry` without a `value` (in protobuf, this is distinct from having a `value` -field with zero bytes). Deletion nodes may persist in the database forever. - - -## Binary Trie Encoding -[trie_encoding]: #trie_encoding - -The following scheme is used to encode trie data structures (sparse, indexed -arrays of pointers to entries) into a variable-length bytestring as the `trie` -field of an Entry protobuf message. - -Consider a trie array with `N` buckets and `M` non-empty buckets (`0 <= M <= -N`). In the encoded field, there will be `M` concatenated bytestrings of the -form: - -- trie index (varint), followed by... -- bucket bitfield (packed in a varint), encoding which of the 5 values (4 - values if the index is not modulo 32) at this node of the trie have pointers, - followed by one or more... -- pointer sets, each referencing an entry at (feed index, entry index): - - feed index (varint, with a extra "more pointers at this value" low bit, - encoded as `feed_index << 1 | more_bit`) - - entry index (varint) - -In the common case for a small/sparse hyperdb, there will a small number of -non-empty buckets (small `M`), a usually a single `(feed index, entry index)` -pointer for those non-empty buckets. For a very large/dense hyperdb (millions -of key/value pairs), there will be many non-empty buckets (`M` approaching -`N`), and buckets may have up to the full 4 pointer sets. Even with millions of -entries, hash collisions will be very rare; in those cases there will be -multiple pointers in the same pointer set. - -Consider an entry with path hash: - -``` -[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1, - 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1, - 4 ] -``` - -and trie: - -``` -[ , { val: 1, feed: 0, index: 1 } ] -``` - -In this case `N` is 64 (or you could count as 2 if you ignore trailing empty -entries) and `M` is 1. There will be a single bytestring chunk: - -- index varint is `1` (second element in the trie array) -- bitfield is `0b0010` (varint 2): there is only a pointer set for value 1 (the second value) -- there is a single pointer in the pointer set, which is (`feed=0 << 1 | 0`, - `index=1`), or (varint 2, varint 1) - -Combined, the `trie` bytestring will be: - -``` -[0x01, 0x02, 0x02, 0x02] -``` - -For a more complex example, consider the same path hash, but the trie: - -``` -[ , { val: 1, feed: 0, index: 1; val: 2, feed: 5, index: 3; val: 3, feed: 6, index: 98 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , - , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , - { val: 4, feed: 0, index, 23; val: 4, feed: 1, index: 17 ] -``` - -Now `M` is 2. The first bytestring chunk will be: - -- index varint is `1` (second element in the trie array) -- bitfield is `0b01110` (varint 9): there are three pointer sets -- first pointer set is (`feed=0 << 1 | 0`, `index=1`) or (varint 2, varint 1) -- second pointer set is (`feed=5 << 1 | 0`, `index=3`) or (varint 10, varint 3) -- third pointer set is (`feed=6 << 1 | 0`, `index=98`) or (varint 12, varint 98) - -the second bytestring chunk would be: - -- index varint is `64` (65th element in the trie array; the terminating value) -- bitfield is `0b10000` (varint 1); there is a single pointer set... but it - contains a hash collision, so there are two pointers -- first pointer is (`feed=0 << 1 | 1`, `index=23`) or (varint 1, varint=23); - note the `more` bit is set high! -- second pointer is (`feed=1 << 1 | 0`, `index=17`) or (varint 2, varint 17); - note the `more` bit is low, as usual. In the extremely unlikely case of - multiple collisions there could have been more pointers with `more` high - preceding this one. - -The overall bytestring would be: - -``` -[0x01, 0x09, 0x02, 0x01, 0x0A, 0x03, 0x0C, 0x62, 0x40, 0x10, 0x01, 0x17, 0x02, 0x11] -``` - - -# Examples -[examples]: #examples - - -## Simple Put and Get - -Starting with an empty hyperdb `db`, if we `db.put('/a/b', '24')` we expect to -see a single `Entry` and index 0: - -``` -{ key: 'a/b', - value: '24', - trie: - [ ] } -``` - -For reference, the path hash array for this key (index 0) is: - -``` -[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, - 0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3, - 4 ] -``` - -Note that the first 64 bytes in path match those of the `/a/b/c` example from -the [path hashing][path_hash] section, because the first two path components -are the same. Since this is the first entry, the entry index is 0. - -Now we `db.put('/a/c', 'hello')` and expect a second Entry: - -``` -{ key: 'a/c', - value: 'hello', - trie: - [ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , - , , { element: 2, feed: 0, index: 0 } ] } -``` - -The path hash array for this key (index 1) is: - -``` -[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, - 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, - 4 ] -``` - -The first 32 characters of path are common with the first Entry (they share a -common prefix, `/a`). - -`trie` is defined, but mostly sparse. The first 32 elements of common prefix -match the first Entry, and then two additional hash elements (`[0, 1]`) happen -to match as well; there is not a differing entry until index 34 (zero-indexed). -At this entry there is a reference pointing to the first Entry. An additional 29 -trailing null entries have been trimmed in reduce metadata overhead. - -Next we insert a third node with `db.put('/x/y', 'other')`, and get a third Entry: - -``` -{ key: 'x/y', - value: 'other', - trie: - [ , { val: 1, feed: 0, index: 1 } ], -``` - -The path hash array for this key (index 2) is: - -``` -[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1, - 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1, - 4 ] -``` - -Consider the lookup-up process for `db.get('/a/b')` (which we expect to -successfully return `'24'`, as written in the first Entry). First we calculate -the path for the key `a/b`, which will be the same as the first Entry. Then we -take the "latest" Entry, with entry index 2. We compare the path hash arrays, -starting at the first element, and find the first difference at index 1 (`1 == -1`, then `1 != 2`). We look at index 1 in the current Entry's `trie` and find a -pointer to entry index 1, so we fetch that Entry and recurse. Comparing path -hash arrays, we now get all the way to index 34 before there is a difference. -We again look in the `trie`, find a pointer to entry index 0, and fetch the -first Entry and recurse. Now the path elements match exactly; we have found the -Entry we are looking for, and it has an existent `value`, so we return the -`value`. - -Consider a lookup for `db.get('/a/z')`; this key does not exist, so we expect -to return with "key not found". We calculate the path hash array for this key: - -``` -[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, - 1, 2, 3, 0, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 0, 3, 3, 2, 0, 3, 3, 1, 1, 0, 2, 1, 0, 1, 1, 2, - 4 ] -``` - -Similar to the first lookup, we start with entry index 2 and follow the pointer to -entry index 1. This time, when we compare path hash arrays, the first differing -entry is at array index `32`. There is no `trie` entry at this index, which -tells us that the key does not exist in the database. - -## Listing a Prefix - -Continuing with the state of the database above, we call `db.list('/a')` to -list all keys with the prefix `/a`. - -We generate a path hash array for the key `/a`, without the terminating symbol -(`4`): - -``` -[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2 ] -``` - -Using the same process as a `get()` lookup, we find the first Entry that -entirely matches this prefix, which will be entry index 1. If we had failed to -find any Entry with a complete prefix match, then we would return an empty list -of matching keys. - -Starting with the first prefix-matching node, we save that key as a match -(unless the Entry is a deletion), then select all `trie` pointers with an index -higher than the prefix length, and recursively inspect all pointed-to Entries. - -## Deleting a Key - -Continuing with the state of the database above, we call `db.delete('/a/c')` to -remove that key from the database. - -The process is almost entirely the same as inserting a new Entry at that key, -except that the `value` field is undefined. The new Entry (at entry index 3) -is: - -``` -{ key: 'a/c', - value: , - trie: [ , { val: 1, feed: 0, index: 2 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , - , , { val: 1, feed: 0, index: 0 } ] } -``` - -The path hash array for this Entry (key) is: - -``` -[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, - 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, - 4 ] -``` - - -# Drawbacks -[drawbacks]: #drawbacks - -A backwards-incompatible change will have negative effects on the broader dat -ecosystem: clients will need to support both versions protocol for some time -(increasing maintenance burden), future clients may not inter-operate with old -archives, etc. These downsides can partially be avoided by careful roll-out. - -For the specific use case of Dat archives, hyperdb will trivially increase -metadata size (and thus disk and network consumption) for archives with few -files. - - -# Overhead and Scaling -[overhead]: #overhead - -The metadata overhead for a single database entry varies based on the size of -the database. In a "heavy" case, considering a two-path-segment key with an -entirely saturated `trie` and `uint32`-sized feed and entry index pointers, and -ignoring multi-writer fields: - -- `trie`: 4 * 2 * 64 bytes = 512 bytes -- total: 512 bytes - -In a "light" case, with few `trie` entries and single-byte varint feed and -entry index pointers: - -- `trie`: 2 * 2 * 4 bytes = 16 bytes -- total: 16 - -For a database with most keys having N path segments, the cost of a `get()` -scales with the number of entries M as `O(log(M))` with best case 1 lookup and -worst case `4 * 32 * N = 128 * N` lookups (for a saturated `trie`). - -The cost of a `put()` or `delete()` is proportional to the cost of a `get()`. - -The cost of a `list()` is linear (`O(M)`) in the number of matching entries, -plus the cost of a single `get()`. - -The total metadata overhead for a database with M entries scales with `O(M -* log(M))`. - - -# Privacy and Security Considerations -[privacy]: #privacy - -The basic key/value semantics of hyperdb (as discussed in this DEP, not -considering multi-writer changes) are not known to introduce new privacy issues -when compared with, e.g., storing binary values at key-like paths using the -current ("legacy") hyperdrive system. - -A malicious writer could cause trouble for readers, even readers who do not -trust the application-level contents of a feed. Implementations which may be -exposed to arbitrary feeds from unknown sources on the internet should take -care to the following scenarios: A malicious writer may be able to produce very -frequent key path hash collisions, which could degrade to linear performance. A -malicious writer could send broken trie structures that contain pointer loops, -duplicate pointers, or other invalid contents. A malicious writer could write -arbitrary data in value fields in an attempt to exploit de-serialization bugs. -A malicious writer could leverage non-printing unicode characters to create -database entries with user-indistinguishable names (keys). - - -# Rationale and alternatives -[alternatives]: #alternatives - -A major motivator for hyperdb is to improve scaling performance with tens of -thousands through millions of files per directory in the existing hyperdrive -implementation. The current implementation requires the most recent node in a -directory to point to all other nodes in the directory. Even with pointer -compression, this requires on the order of `O(N^2)` bytes; the hyperdb -implementation scales with `O(N log(N))`. - -The hyperdb specification (this document) is complicated by the inclusion of -new protobuf fields to support "multi-writer" features which are not described -here. The motivation to include these fields now to make only a single -backwards-incompatible schema change, and to make a second software-only change -in the future to enable support for these features. Schema and data format -changes are considered significantly more "expensive" for the community and -software ecosystem compared to software-only changes. Attempts have been made -in this specification to indicate the safe "single-writer-only" values to use -for these fields. - - -# Dat migration logistics -[migration]: #migration - -Hyperdb is not backwards compatible with the existing hyperdrive metadata, -meaning dat clients may need to support both versions during a transition -period. This applies both to archives saved to disk (e,g., in SLEEP) and to -archives received and published to peers over the network. - -No changes to the Dat network wire protocol itself are necessary, only changes -to content passed over the protocol. The Dat `content` feed, containing raw -file data, is not impacted by hyperdb, only the contents of the `metadata` -feed. - -Upgrading a Dat (hyperdrive) archive to hyperdb will necessitate creating a new -feed from scratch, meaning new public/private key pairs, and that public key -URL links will need to change. - -Further logistical details are left to the forthcoming Multi-Writer DEP. - - -# Unresolved questions -[unresolved]: #unresolved-questions - -Need to think through deletion process with respect to listing a path prefix; -will previously deleted nodes be occluded, or potentially show up in list -results? Should be reviewed (by a non-author of this document) before accepted -as a Draft. - -Can the deletion process (currently leaving "tombstone" entries in the `trie` -forever) be improved, such that these entries don't need to be iterated over? -mafintosh mentioned this might be in the works. Does this DEP need to "leave -room" for those changes, or should we call out the potential for future change? -Probably not, should only describe existing solutions. This can be resolved -after Draft. - -There are implied "reasonable" limits on the size (in bytes) of both keys and -values, but they are not formally specified. Protobuf messages have a hard -specified limit of 2 GByte (due to 32-bit signed arithmetic), and most -implementations impose a (configurable) 64 MByte limit. Should this DEP impose -specific limits on key and value sizes? Would be good to decide before Draft -status. - -Apart from leaving fields in the protobuf message specification, multi-writer -concerns are explicitly out of scope for this DEP. - - -# Changelog -[changelog]: #changelog - -As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb -nodejs module on [github](https://github.com/mafintosh/hyperdb), which is the -basis for this DEP. - -- 2017-12-06: Stephen Whitmore (@noffle) publishes `ARCHITECTURE.md` overview - in the [hyperdb github repo][arch_md] -- 2018-03-04: First draft for review -- 2018-03-15: Hyperdb v3.0.0 is released -- 2018-04-18: This DEP submitted for Draft review. -- 2018-05-06: Merged as Draft after WG approval. - -[arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md diff --git a/proposals/0004-hyperdb.md b/proposals/0004-hyperdb.md new file mode 100644 index 0000000..d551b64 --- /dev/null +++ b/proposals/0004-hyperdb.md @@ -0,0 +1,760 @@ + +Title: **DEP-0004: Hyperdb** + +Short Name: `0004-hyperdb` + +Type: Standard + +Status: Draft (as of 2018-05-06) + +Github PR: [Draft](https://github.com/datprotocol/DEPs/pull/3) + +Authors: +[Bryan Newbold](https://github.com/bnewbold), +[Stephen Whitmore](https://github.com/noffle), +[Mathias Buus](https://github.com/mafintosh) + + +# Summary +[summary]: #summary + +Hyperdb is an abstraction layer providing a general purpose distributed +key/value store over the hypercore protocol. It is an iteration on the +hyperdrive directory tree implementation, building on top of the hypercore +append-only log abstraction layer. Keys are path-like strings (e.g., +`/food/fruit/kiwi`), and values are arbitrary binary blobs (generally under a +megabyte). + +Hyperdrive (used by the Dat application) is expected to be re-implemented on +top of hyperdb for improved performance with many files (e.g., millions). The +hyperdrive API should be largely unchanged, but the `metadata` format will be +backwards-incompatible. + + +# Motivation +[motivation]: #motivation + +Hyperdb is expected to drastically improve performance of dat clients when +working with archives containing tens of thousands of files in single +directories. This is a real-world bottleneck for several current users, with +basic local actions such as adding a directory taking an unacceptably long time +to complete. + +A secondary benefit is to refactor the [trie][trie]-structured key/value API +out of hyperdrive, allowing third party code to build applications directly on +this abstraction layer. + +[trie]: https://en.wikipedia.org/wiki/Trie + + +# Usage Documentation +[usage-documentation]: #usage-documentation + +*This section describes Hyperdb's interface and behavior in the abstract for +application programmers. It is not intended to be exact documentation of any +particular implementation (including the reference Javascript module).* + +Hyperdb is structured to be used much like a traditional hierarchical +filesystem. A value can be written and read at locations like `/foo/bar/baz`, +and the API supports querying or tracking values at subpaths, like how watching +for changes on `/foo/bar` will report both changes to `/foo/bar/baz` and also +`/foo/bar/19`. + +Lower-level details of the hypercore append-only log, disk serialization, and +networked synchronization features that Hyperdb builds on top of are not +described in detail here; see the [DEP repository][deps]. Multi-writer +hypercore semantics are also not discussed in this DEP. + +[deps]: https://github.com/datprotocol/DEPs + +A Hyperdb database instance can be represented by a single hypercore feed (or +several feeds in a multi-writer context), and is named, referenced, and +discovered using the public and discovery keys of the hypercore feed (or the +original feed if there are several). In a single-writer configuration, only a +single node (holding the secret key) can mutate the database (e.g., via `put` or +`delete` actions). + +**Keys** can be any UTF-8 string. Path segments are separated by the forward +slash character (`/`). Repeated slashes (`//`) are disallowed. Leading and +trailing `/` are optional in application code: `/hello` and `hello` are +equivalent. A key can be both a "path segment" and key at the same time; e.g., +`/a/b/c` and `/a/b` can both be keys at the same time. + +**Values** can be any binary blob, including empty (of zero length). For +example, values could be UTF-8 encoded strings, JSON encoded objects, protobuf +messages, or a raw `uint64` integer (of either endian-ness). Length is the only +form of type or metadata stored about the value; deserialization and validation +are left to library and application developers. + + +## Core API Semantics +[core_api]: #core_api + +A `db` is instantiated by opening an existing hypercore with hyperdb content +(read-only, or optionally read-write if the secret key is available), or +creating a new one. A handle could represent any specific revision in history, +or the "latest" revision. + +`db.put(key, value)`: inserts `value` (arbitrary bytes) under the path `key`. +Requires read-write access. Returns an error (e.g., via callback) if there was a +problem. + +`db.get(key)`: Reading a non-existent `key` is an error. Read-only. + +`db.delete(key)`: Removes the key from the database. Deleting a non-existent +key is an error. Requires read-write access. + +`db.list(prefix)`: returns a flat (not nested) list of all keys currently in +the database under the given prefix. Prefixes operate on a path-segment basis: +`/ab` is not a valid prefix for key `/abcd`, but is valid for `/ab/cd`. If the +prefix does not exist, returns an empty list. The order of returned keys is +implementation (or configuration) specific. Default listing is recursive +(implementations may have a flag to control this behavior). Read-only. + +If the hypercore underlying a hyperdb is only partially replicated, behavior is +implementation-specific. For example, a `get()` call could block until the +relevant value is replicated, or the implementation could return an error. + +An example pseudo-code session working with a database might be: + + db.put('/life/animal/mammal/kitten', '{"cuteness": 500.3}') + db.put('/life/plant/bush/banana', '{"delicious": 103.4}') + db.delete('/life/plant/bush/banana') + db.put('/life/plant/tree/banana', '{"delicious": 103.4}') + db.get('/life/animal/mammal/kitten') + => {"cuteness": 500.3} + db.list('/life/') + => ['/life/animal/mammal/kitten', '/life/plant/tree/banana'] + + +# Reference Documentation +[reference-documentation]: #reference-documentation + +A hyperdb hypercore feed typically consists of a sequence of protobuf-encoded +messages of "Entry" type. Higher-level protocols may make exception to this, +for example by prepending an application-specific metadata message as the first +entry in the feed. There is sometimes a second "content" feed associated with +the primary hyperdb key/value feed, to store data that does not fit in the +(limited) `value` size constraint. + +The sequence of entries includes an incremental index: the most recent entry in +the feed contains metadata pointers that can be followed to efficiently look up +any key in the database without needing to linear scan the entire history or +generate an independent index data structure. Implementations are, of course, +free to maintain their own index if they prefer. + +The Entry protobuf message schema is: + + message Entry { + message Feed { + required bytes key = 1; + } + + required string key = 1; + optional bytes value = 2; + required bytes trie = 3; + repeated uint64 clock = 4; + optional uint64 inflate = 5; + repeated Feed feeds = 6; + optional bytes contentFeed = 7; + } + +Some fields are are specific to the multi-writer features described in their +own DEP and mentioned only partially here. The fields common to both message +types are: + +- `key`: UTF-8 key that this node describes. Leading and trailing forward + slashes (`/`) are always stripped before storing in protobuf. +- `value`: arbitrary byte array. A non-existent `value` entry indicates that + this Entry indicates a deletion of the key; this is distinct from specifying + an empty (zero-length) value. +- `trie`: a structured array of pointers to other Entry entries in the feed, + used for navigating the tree of keys. +- `clock`: reserved for use in the forthcoming `multi-writer` standard. An + empty list is the safe (and expected) value for `clock` in single-writer use + cases. +- `inflate`: a "pointer" (reference to a feed index number) to the most recent + entry in the feed that has the optional `feeds` and `contentFeed` fields set. + Not defined if `feeds` and `contentFeed` are defined in the same message. +- `feeds`: reserved for use with `multi-writer`. The safe single-writer value is + to use the current feed's hypercore public key. +- `contentFeed`: for applications which require a parallel "content" hypercore + feed for larger data, this field can be used to store the 32-byte public key + for that feed. This field should have a single value for the entire history + of the feed (aka, it is not mutable). + +For the case of a single-writer feed, not using multi-writer features, it is +sufficient to write a single Entry message as the first entry in the hypercore +feed, with `feeds` containing a single entry (a pointer to the current feed +itself), and `contentFeed` optionally set to a pointer to a paired content +feed. + +If *either* `feeds` *or* `contentFeed` are defined in an entry, *both* fields +must be defined, so the new message can be referred to via `inflated`. In this +case the entry is refereed to as an "inflated entry". + + +## Path Hashing +[path_hashing]: #path_hashing + +Every key path has component-wise fixed-size hash representation that is used +by the trie. The concatenation of all path hashes yields a "path hash array" +for the entire key. Note that analogously to a hash map data structure, there +can be more than one key (string) with the same key hash in the same database +with no problems: the hash points to a linked-list "bucket" of Entries, which +can be iterated over linearly to find the correct value. + +The path hash is represented by an array of bytes. Elements are 2-bit encoded +(values 0, 1, 2, 3), except for an optional terminating element which has value +4. Each path element consists of 32 values, representing a 64-bit hash of that +path element. For example, the key `/tree/willow` has two path segments (`tree` +and `willow`), and will be represented by a 65 element path hash array (two 32 +element hashes plus a terminator). + +The hash algorithm used is `SipHash-2-4`, with an 8-byte output and +16-byte key; the input is the UTF-8 encoded path string segment, without any +`/` separators or terminating null bytes. An implementation of this hash +algorithm is included in the libsodium library in the form of the +`crypto_shorthash()` function. A 16-byte "secret" key is required; for this use +case we use all zeros. + +When converting the 8-bytehash to an array of 2-bit bytes, the ordering is +proceed byte-by-byte, and for each take the two lowest-value bits (aka, `hash & +0x3`) as byte index `0`, the next two bits (aka, `hash & 0xC`) as byte index +`1`, etc. When concatenating path hashes into a longer array, the first +("left-most") path element hash will correspond to byte indexes 0 through 31; +the terminator (`4`) will have the highest byte index. + +For example, consider the key `/tree/willow`. `tree` has a hash `[0xAC, 0xDC, +0x05, 0x6C, 0x63, 0x9D, 0x87, 0xCA]`, which converts into the array: + +``` +[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3 ] +``` + + +`willow` has a 64-bit hash `[0x72, 0x30, 0x34, 0x39, 0x35, 0xA8, 0x21, 0x44]`, +which converts into the array: + +``` +[ 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1 ] +``` + +These combine into the unified byte array with 65 elements: + +``` +[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3, + 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1, + 4 ] +``` + +As another example, the key `/a/b/c` converts into the 97-byte hash array: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3, + 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, + 4 ] +``` + +Note that "hash collisions" are rare with this hashing scheme, but are likely +to occur with large databases (millions of keys), and collision have been +generated as a proof of concept. Implementations should take care to properly +handle collisions by verifying keys and following bucket pointers (see the next +section). + +An example hash collision (useful for testing; thanks to Github user +`dcposch`): + + /mpomeiehc + /idgcmnmna + + + + +## Incremental Index Trie +[trie_index]: #trie_index + +Each node stores a *prefix [trie](https://en.wikipedia.org/wiki/Trie)* that +can be used to look up other keys, or to list all keys with a given prefix. +This is stored under the `trie` field of the Entry protobuf message. + +The trie effectively mirrors the path hash array. Each element in the `trie` +is called a "bucket". Each non-empty bucket points to the newest Entries which +have an identical path up to that specific prefix location; because the trie +has 4 "values" at each node, there can be pointers to up to 3 other values at a +given element in the trie array. Buckets can be empty if there are no nodes +with path hashes that differ for the first time the given bucket (aka, there +are no "branches" at this node in the trie). Only non-null elements will be +transmitted or stored on disk. + +The data structure of the trie is a sparse array of pointers to other Entry +entries. Each pointer indicates a feed index and an entry index pointer, and is +associated with a 2-bit value; for the non-multi-writer case, the feed index is +always 0, so we consider only the entry index. + +For a `trie` with `N` buckets, each may have zero or more pointers. Typically +there are a maximum of 3 pointers per bucket, corresponding to the 4 possible +values minus the current Entry's value, but in the event of hash collisions (in +the path array space), there may be multiple pointers in the same bucket +corresponding to the same value. + +To lookup a key in the database, the recipe is to: + +1. Calculate the path hash array for the key you are looking for. +2. Select the most-recent ("latest") Entry for the feed. +3. Compare path hash arrays. If the paths match exactly, compare keys; they + match, you have found the you were looking for! Check whether the `value` is + defined or not; if not, this Entry represents that the key was deleted from + the database. +4. If the paths match, but not the keys, look for a pointer in the last `trie` + array index, and iterate from step #3 with the new Entry. +5. If the paths don't entirely match, find the first index at which the two + arrays differ, and look up the corresponding element in this Entry's `trie` + array. If the element is empty, or doesn't have a pointer corresponding to + your 2-bit value, then your key does not exist in this hyperdb. +6. If the trie element is not empty, then follow that pointer to select the + next Entry. Recursively repeat this process from step #3; you will be + descending the `trie` in a search, and will either terminate in the Entry you + are looking for, or find that the key is not defined in this hyperdb. + +Similarly, to write a key to the database: + +1. Calculate the path hash array for the key, and start with an empty `trie` of + the same length; you will write to the `trie` array from the current index, + which starts at 0. +2. Select the most-recent ("latest") Entry for the feed. +3. Compare path hash arrays. If the paths match exactly, and the keys match, then + you are overwriting the current Entry, and can copy the "remainder" of it's + `trie` up to your current `trie` index. +4. If the paths match, but not the keys, you are adding a new key to an + existing hash bucket. Copy the `trie` and extend it to the full length. Add + a pointer to the Entry with the same hash at the final array index. +5. If the paths don't entirely match, find the first index at which the two + arrays differ. Copy all `trie` elements (empty or not) into the new `trie` + for indices between the "current index" and the "differing index". +6. Next look up the corresponding element in this Entry's `trie` array at the + differing index. If this element is empty, then you have found the most + similar Entry. Write a pointer to this node to the `trie` at + the differing index, and you are done (all remaining `trie` elements are + empty, and can be omitted). +7. If the differing tree element has a pointer (is not empty), then follow that + pointer to select the next `Entry`. Recursively repeat this process from step + #3. + +To delete a value, follow the same procedure as adding a key, but write the +`Entry` without a `value` (in protobuf, this is distinct from having a `value` +field with zero bytes). Deletion nodes may persist in the database forever. + + +## Binary Trie Encoding +[trie_encoding]: #trie_encoding + +The following scheme is used to encode trie data structures (sparse, indexed +arrays of pointers to entries) into a variable-length bytestring as the `trie` +field of an Entry protobuf message. + +Consider a trie array with `N` buckets and `M` non-empty buckets (`0 <= M <= +N`). In the encoded field, there will be `M` concatenated bytestrings of the +form: + +- trie index (varint), followed by... +- bucket bitfield (packed in a varint), encoding which of the 5 values (4 + values if the index is not modulo 32) at this node of the trie have pointers, + followed by one or more... +- pointer sets, each referencing an entry at (feed index, entry index): + - feed index (varint, with a extra "more pointers at this value" low bit, + encoded as `feed_index << 1 | more_bit`) + - entry index (varint) + +In the common case for a small/sparse hyperdb, there will a small number of +non-empty buckets (small `M`), a usually a single `(feed index, entry index)` +pointer for those non-empty buckets. For a very large/dense hyperdb (millions +of key/value pairs), there will be many non-empty buckets (`M` approaching +`N`), and buckets may have up to the full 4 pointer sets. Even with millions of +entries, hash collisions will be very rare; in those cases there will be +multiple pointers in the same pointer set. + +Consider an entry with path hash: + +``` +[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1, + 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1, + 4 ] +``` + +and trie: + +``` +[ , { val: 1, feed: 0, index: 1 } ] +``` + +In this case `N` is 64 (or you could count as 2 if you ignore trailing empty +entries) and `M` is 1. There will be a single bytestring chunk: + +- index varint is `1` (second element in the trie array) +- bitfield is `0b0010` (varint 2): there is only a pointer set for value 1 (the second value) +- there is a single pointer in the pointer set, which is (`feed=0 << 1 | 0`, + `index=1`), or (varint 2, varint 1) + +Combined, the `trie` bytestring will be: + +``` +[0x01, 0x02, 0x02, 0x02] +``` + +For a more complex example, consider the same path hash, but the trie: + +``` +[ , { val: 1, feed: 0, index: 1; val: 2, feed: 5, index: 3; val: 3, feed: 6, index: 98 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + { val: 4, feed: 0, index, 23; val: 4, feed: 1, index: 17 ] +``` + +Now `M` is 2. The first bytestring chunk will be: + +- index varint is `1` (second element in the trie array) +- bitfield is `0b01110` (varint 9): there are three pointer sets +- first pointer set is (`feed=0 << 1 | 0`, `index=1`) or (varint 2, varint 1) +- second pointer set is (`feed=5 << 1 | 0`, `index=3`) or (varint 10, varint 3) +- third pointer set is (`feed=6 << 1 | 0`, `index=98`) or (varint 12, varint 98) + +the second bytestring chunk would be: + +- index varint is `64` (65th element in the trie array; the terminating value) +- bitfield is `0b10000` (varint 1); there is a single pointer set... but it + contains a hash collision, so there are two pointers +- first pointer is (`feed=0 << 1 | 1`, `index=23`) or (varint 1, varint=23); + note the `more` bit is set high! +- second pointer is (`feed=1 << 1 | 0`, `index=17`) or (varint 2, varint 17); + note the `more` bit is low, as usual. In the extremely unlikely case of + multiple collisions there could have been more pointers with `more` high + preceding this one. + +The overall bytestring would be: + +``` +[0x01, 0x09, 0x02, 0x01, 0x0A, 0x03, 0x0C, 0x62, 0x40, 0x10, 0x01, 0x17, 0x02, 0x11] +``` + + +# Examples +[examples]: #examples + + +## Simple Put and Get + +Starting with an empty hyperdb `db`, if we `db.put('/a/b', '24')` we expect to +see a single `Entry` and index 0: + +``` +{ key: 'a/b', + value: '24', + trie: + [ ] } +``` + +For reference, the path hash array for this key (index 0) is: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3, + 4 ] +``` + +Note that the first 64 bytes in path match those of the `/a/b/c` example from +the [path hashing][path_hash] section, because the first two path components +are the same. Since this is the first entry, the entry index is 0. + +Now we `db.put('/a/c', 'hello')` and expect a second Entry: + +``` +{ key: 'a/c', + value: 'hello', + trie: + [ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , { element: 2, feed: 0, index: 0 } ] } +``` + +The path hash array for this key (index 1) is: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, + 4 ] +``` + +The first 32 characters of path are common with the first Entry (they share a +common prefix, `/a`). + +`trie` is defined, but mostly sparse. The first 32 elements of common prefix +match the first Entry, and then two additional hash elements (`[0, 1]`) happen +to match as well; there is not a differing entry until index 34 (zero-indexed). +At this entry there is a reference pointing to the first Entry. An additional 29 +trailing null entries have been trimmed in reduce metadata overhead. + +Next we insert a third node with `db.put('/x/y', 'other')`, and get a third Entry: + +``` +{ key: 'x/y', + value: 'other', + trie: + [ , { val: 1, feed: 0, index: 1 } ], +``` + +The path hash array for this key (index 2) is: + +``` +[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1, + 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1, + 4 ] +``` + +Consider the lookup-up process for `db.get('/a/b')` (which we expect to +successfully return `'24'`, as written in the first Entry). First we calculate +the path for the key `a/b`, which will be the same as the first Entry. Then we +take the "latest" Entry, with entry index 2. We compare the path hash arrays, +starting at the first element, and find the first difference at index 1 (`1 == +1`, then `1 != 2`). We look at index 1 in the current Entry's `trie` and find a +pointer to entry index 1, so we fetch that Entry and recurse. Comparing path +hash arrays, we now get all the way to index 34 before there is a difference. +We again look in the `trie`, find a pointer to entry index 0, and fetch the +first Entry and recurse. Now the path elements match exactly; we have found the +Entry we are looking for, and it has an existent `value`, so we return the +`value`. + +Consider a lookup for `db.get('/a/z')`; this key does not exist, so we expect +to return with "key not found". We calculate the path hash array for this key: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 1, 2, 3, 0, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 0, 3, 3, 2, 0, 3, 3, 1, 1, 0, 2, 1, 0, 1, 1, 2, + 4 ] +``` + +Similar to the first lookup, we start with entry index 2 and follow the pointer to +entry index 1. This time, when we compare path hash arrays, the first differing +entry is at array index `32`. There is no `trie` entry at this index, which +tells us that the key does not exist in the database. + +## Listing a Prefix + +Continuing with the state of the database above, we call `db.list('/a')` to +list all keys with the prefix `/a`. + +We generate a path hash array for the key `/a`, without the terminating symbol +(`4`): + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2 ] +``` + +Using the same process as a `get()` lookup, we find the first Entry that +entirely matches this prefix, which will be entry index 1. If we had failed to +find any Entry with a complete prefix match, then we would return an empty list +of matching keys. + +Starting with the first prefix-matching node, we save that key as a match +(unless the Entry is a deletion), then select all `trie` pointers with an index +higher than the prefix length, and recursively inspect all pointed-to Entries. + +## Deleting a Key + +Continuing with the state of the database above, we call `db.delete('/a/c')` to +remove that key from the database. + +The process is almost entirely the same as inserting a new Entry at that key, +except that the `value` field is undefined. The new Entry (at entry index 3) +is: + +``` +{ key: 'a/c', + value: , + trie: [ , { val: 1, feed: 0, index: 2 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , { val: 1, feed: 0, index: 0 } ] } +``` + +The path hash array for this Entry (key) is: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, + 4 ] +``` + + +# Drawbacks +[drawbacks]: #drawbacks + +A backwards-incompatible change will have negative effects on the broader dat +ecosystem: clients will need to support both versions protocol for some time +(increasing maintenance burden), future clients may not inter-operate with old +archives, etc. These downsides can partially be avoided by careful roll-out. + +For the specific use case of Dat archives, hyperdb will trivially increase +metadata size (and thus disk and network consumption) for archives with few +files. + + +# Overhead and Scaling +[overhead]: #overhead + +The metadata overhead for a single database entry varies based on the size of +the database. In a "heavy" case, considering a two-path-segment key with an +entirely saturated `trie` and `uint32`-sized feed and entry index pointers, and +ignoring multi-writer fields: + +- `trie`: 4 * 2 * 64 bytes = 512 bytes +- total: 512 bytes + +In a "light" case, with few `trie` entries and single-byte varint feed and +entry index pointers: + +- `trie`: 2 * 2 * 4 bytes = 16 bytes +- total: 16 + +For a database with most keys having N path segments, the cost of a `get()` +scales with the number of entries M as `O(log(M))` with best case 1 lookup and +worst case `4 * 32 * N = 128 * N` lookups (for a saturated `trie`). + +The cost of a `put()` or `delete()` is proportional to the cost of a `get()`. + +The cost of a `list()` is linear (`O(M)`) in the number of matching entries, +plus the cost of a single `get()`. + +The total metadata overhead for a database with M entries scales with `O(M +* log(M))`. + + +# Privacy and Security Considerations +[privacy]: #privacy + +The basic key/value semantics of hyperdb (as discussed in this DEP, not +considering multi-writer changes) are not known to introduce new privacy issues +when compared with, e.g., storing binary values at key-like paths using the +current ("legacy") hyperdrive system. + +A malicious writer could cause trouble for readers, even readers who do not +trust the application-level contents of a feed. Implementations which may be +exposed to arbitrary feeds from unknown sources on the internet should take +care to the following scenarios: A malicious writer may be able to produce very +frequent key path hash collisions, which could degrade to linear performance. A +malicious writer could send broken trie structures that contain pointer loops, +duplicate pointers, or other invalid contents. A malicious writer could write +arbitrary data in value fields in an attempt to exploit de-serialization bugs. +A malicious writer could leverage non-printing unicode characters to create +database entries with user-indistinguishable names (keys). + + +# Rationale and alternatives +[alternatives]: #alternatives + +A major motivator for hyperdb is to improve scaling performance with tens of +thousands through millions of files per directory in the existing hyperdrive +implementation. The current implementation requires the most recent node in a +directory to point to all other nodes in the directory. Even with pointer +compression, this requires on the order of `O(N^2)` bytes; the hyperdb +implementation scales with `O(N log(N))`. + +The hyperdb specification (this document) is complicated by the inclusion of +new protobuf fields to support "multi-writer" features which are not described +here. The motivation to include these fields now to make only a single +backwards-incompatible schema change, and to make a second software-only change +in the future to enable support for these features. Schema and data format +changes are considered significantly more "expensive" for the community and +software ecosystem compared to software-only changes. Attempts have been made +in this specification to indicate the safe "single-writer-only" values to use +for these fields. + + +# Dat migration logistics +[migration]: #migration + +Hyperdb is not backwards compatible with the existing hyperdrive metadata, +meaning dat clients may need to support both versions during a transition +period. This applies both to archives saved to disk (e,g., in SLEEP) and to +archives received and published to peers over the network. + +No changes to the Dat network wire protocol itself are necessary, only changes +to content passed over the protocol. The Dat `content` feed, containing raw +file data, is not impacted by hyperdb, only the contents of the `metadata` +feed. + +Upgrading a Dat (hyperdrive) archive to hyperdb will necessitate creating a new +feed from scratch, meaning new public/private key pairs, and that public key +URL links will need to change. + +Further logistical details are left to the forthcoming Multi-Writer DEP. + + +# Unresolved questions +[unresolved]: #unresolved-questions + +Need to think through deletion process with respect to listing a path prefix; +will previously deleted nodes be occluded, or potentially show up in list +results? Should be reviewed (by a non-author of this document) before accepted +as a Draft. + +Can the deletion process (currently leaving "tombstone" entries in the `trie` +forever) be improved, such that these entries don't need to be iterated over? +mafintosh mentioned this might be in the works. Does this DEP need to "leave +room" for those changes, or should we call out the potential for future change? +Probably not, should only describe existing solutions. This can be resolved +after Draft. + +There are implied "reasonable" limits on the size (in bytes) of both keys and +values, but they are not formally specified. Protobuf messages have a hard +specified limit of 2 GByte (due to 32-bit signed arithmetic), and most +implementations impose a (configurable) 64 MByte limit. Should this DEP impose +specific limits on key and value sizes? Would be good to decide before Draft +status. + +Apart from leaving fields in the protobuf message specification, multi-writer +concerns are explicitly out of scope for this DEP. + + +# Changelog +[changelog]: #changelog + +As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb +nodejs module on [github](https://github.com/mafintosh/hyperdb), which is the +basis for this DEP. + +- 2017-12-06: Stephen Whitmore (@noffle) publishes `ARCHITECTURE.md` overview + in the [hyperdb github repo][arch_md] +- 2018-03-04: First draft for review +- 2018-03-15: Hyperdb v3.0.0 is released +- 2018-04-18: This DEP submitted for Draft review. +- 2018-05-06: Merged as Draft after WG approval. + +[arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md -- cgit v1.2.3