From 7dc1759de6afc61acfe8b47e34ab18160e6c803e Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 4 Mar 2018 14:50:29 -0800 Subject: significantly more progress on hyperdb DEP --- proposals/0000-hyperdb.md | 418 ++++++++++++++++++++++++++++++++++++---------- 1 file changed, 331 insertions(+), 87 deletions(-) (limited to 'proposals') diff --git a/proposals/0000-hyperdb.md b/proposals/0000-hyperdb.md index 2a0656e..9a144b2 100644 --- a/proposals/0000-hyperdb.md +++ b/proposals/0000-hyperdb.md @@ -82,7 +82,9 @@ messages, or a raw `uint64` integer (of either endian-ness). Length is the only form of type or metadata stored about the value; deserialization and validation are left to library and application developers. + ## Core API Semantics +[core_api]: #core_api A `db` is instantiated by opening an existing hypercore with hyperdb content (read-only, or optionally read-write if the secret key is available), or @@ -119,83 +121,213 @@ An example pseudo-code session working with a database might be: db.list('/life/') => ['/life/animal/mammal/kitten', '/life/plant/tree/bananna'] + # Reference Documentation [reference-documentation]: #reference-documentation A HyperDB hypercore feed typically consists of a sequence of entries, all of -which are protobuf-encoded `Node` messages. Higher-level protocols may make +which are protobuf-encoded Node messages. Higher-level protocols may make exception to this: for example, hyperdrive reserves the first entry of the (`metadata`) feed for a special entry that refers to the second (`content`) feed. -The `Node` protobuf message schema is: +The sequence of entries includes an incremental index: the most recent entry in +the feed contains metadata pointers that can be followed to efficiently look up +any key in the database without needing to linear scan the entire history or +generate an independent index data structure. Of course implementations are +free to maintain such an index if they prefer. + +The Node protobuf message schema is: ``` message Node { optional string key = 1; optional bytes value = 2; repeated uint64 clock = 3; - optional bytes trie = 4; - optional uint64 feedSeq = 6; + optional bytes trie = 4; // TODO: actually a {feed, seq} pair + optional bytes path = 5; // TODO: actual type? + optional uint64 seq = 6; + optional bytes feed = 7; // TODO: actual type? } ``` TODO(mafintosh): where is this schema actually defined for the `next` branch of the hyperdb repo? -`key` and `value` are as expected. A non-existant `value` indicates that this -message is a deletion (see below). The `clock` element is reserved for use in -the forthcoming `multi-writer` standard, not described here. An empty list is -the safe (and expected) value for `clock` in single-writer use cases. +- `key`: UTF-8 key that this node describes. Leading and trailing forward + slashes (`/`) are always striped before storing in protobuf. +- `value`: arbitrary byte array. A non-existant `value` entry indicates that + this Node indicates a deletion of the key; this is distinct from specifying + an empty (zero-length) value. +- `clock`: reserved for use in the forthcoming `multi-writer` standard, not + described here. An empty list is the safe (and expected) value for `clock` in + single-writer use cases. +- `trie`: a structured array of pointers to other Node entries in the feed, + used for navigating the tree of keys. +- `path`: a 2-bit hash sequence of `key`'s components. +- `seq`: the zero-indexed sequence number of this Node (hyperdb) entry in + the feed. Note that this may be offset from the hypercore entry index if + there are any leading non-hyperdb entries. +- `feed`: reserved for use with `multi-writer`. The safe single-writer value is + to use the current feed's hypercore public key. + +TODO(mafintosh): should `feed` actually be the current hypercore publickey? + + +## Path Hashing +[path_hashing]: #path_hashing + +Every key path has component-wise fixed-size hash representation that is used +by the trie. This is written to the `path` field each Node protobuf +message. + +The path hash is represented by an array of bytes. Elements are 2-bit encoded +(values 0, 1, 2, 3), except for an optional terminating element which has value +4. Each path element consists of 32 values, representing a 64-bit hash of that +path element. For example, the key `/tree/willow` has two path segments (`tree` +and `willow`), and will be represented by a 65 element path hash array (two 32 +element hashes plus a terminator). + +The hash algorithm used is `SipHash-2-4`, with an 8-byte output and +16-byte key; the input is the UTF-8 encoded path string segment, without any +`/` separators or terminating null bytes. An implementation of this hash +algorithm is included in the libsodium library, under the function +`crypto_shorthash()`. A 16-byte "secret" key is required; for this use case we +use all zeros. + +When converting the 8-bytehash to an array of 2-bit bytes, the ordering is +proceed byte-by-byte, and for each take the two lowest-value bits (aka, `hash & +0x3`) as byte index `0`, the next two bits (aka, `hash & 0xC`) as byte index +`1`, etc. When concatanating path hashes into a longer array, the first +("left-most") path element hash will correspond to byte indexes 0 through 31; +the terminator (`4`) will have the highest byte index. + +For example, consider the key `/tree/willow`. `tree` has a hash `[0xAC, 0xDC, +0x05, 0x6C, 0x63, 0x9D, 0x87, 0xCA]`, which converts into the array: + + [ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3 ] + + +`willow` has a 64-bit hash `[0x72, 0x30, 0x34, 0x39, 0x35, 0xA8, 0x21, 0x44]`, +which converts into the array: + + [ 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1 ] + +These combine into the unified byte array with 65 elements: -## Incremental Index + [ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3, + 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1, + 4 ] -HyperDB builds an *incremental index* with every new key/value pairs ("nodes") -written. This means a separate data structure doesn't need to be maintained -elsewhere for fast writes and lookups: each node written has enough information -to look up any other key quickly and otherwise navigate the database. +As another example, the key `/a/b/c` converts into the 97-byte hash array: -Each node stores the following basic information: + [ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3, + 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, + 4 ] -- `key`: the key that is being created or modified. e.g. `/home/sww/dev.md` -- `value`: the value stored at that key. -- `feedSeq`: the sequence number of this entry in the hypercore. 0 is the first, 1 - the second, and so forth. -- `path`: a 2-bit hash sequence of the key's components -- `trie`: a navigation structure used with `path` to find a desired key + + + +## Incremental Index Trie +[trie_index]: #trie_index Each node stores a *prefix [trie](https://en.wikipedia.org/wiki/Trie)* that -assists with finding the shortest path to the desired key. - -When a node is written, its *prefix hash* is computed. This done by first -splitting the key into its components (`a`, `b`, and `c` for `/a/b/c`), and then -hashing each component into a 32-character hash, where one character is a 2-bit -value (0, 1, 2, or 3). The `prefix` hash for `/a/b/c` is - -```js -node.path = [ -1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, -0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3, -0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, -4 ] -``` +can be used to look up other keys, or to list all keys with a given prefix. +This is stored under the `trie` field of the Node protobuf message. + +The trie effectively mirrors the `path` hash array. Each element in the `trie` +points to the newest Node which has an identical path up to that specific +prefix location. Elements can be null; any trailing null elements can be left +blank. + +The data structure of the trie is a sparse array of pointers to other Node +entries. Each pointer indicates a feed index and a sequence number; for the +non-multi-writer case, the feed index is always 0, so we consider only the +sequence number (aka, entry index). + +To lookup a key in the database, the recipe is to: + +1. Calculate the `path` array for the key you are looking for. +2. Select the most-recent ("latest") Node for the feed. +3. Compare path arrays. If the paths match exactly, you have found the Node you + were looking for! +4. If not, find the first index at which the two arrays differ, and look up the + corresponding element in this Node's `trie` array. If the element is empty, + then your key does not exist in this hyperdb. +5. If the trie element is not empty, then follow that pointer to select the + next `Node`. Recursively repeat this process from step #3; you will be + descending the `trie` in a search, and will either terminate in the Node you + are looking for, or find that the key is not defined in this hyperdb. + +Similarly, to write a key to the database: + +1. Calculate the `path` array for the key, and start with an empty `trie` of + the same length; you will write to the `trie` array from the current index, + which starts at 0. +2. Select the most-recent ("latest") Node for the feed. +3. Compare path arrays. If the paths match exactly, then you are overwriting + the current Node, and can copy the "remainder" of it's `trie` up to your + current `trie` index. +4. If not, find the first index at which the two arrays differ. Copy all `trie` + elements (empty or not) into the new `trie` for indicies between the + "current index" and the "differing index". +5. Next look up the corresponding element in this Node's `trie` array at the + differing index. If this element is empty, then you have found the most + similar Node. Write a pointer to this node to the `trie` at + the differing index, and you are done (all remaining `trie` elements are + empty, and can be omitted). +6. If the differing tree element has a pointer (is not empty), then follow that + pointer to select the next `Node`. Recursively repeat this process from step + #3. + +To delete a value, follow the same procedure as adding a key, but write the +`Node` without a `value` (in protobuf, this is distinct from having a `value` +field with zero bytes).. -Each component is divided by a newline. `4` is a special value indicating the -end of the prefix. +# Examples +[examples]: #examples -#### Example +## Simple Put and Get -Consider a fresh HyperDB. We write `/a/b = 24` and get back this node: +Starting with an empty HyperDB `db`, if we `db.put('/a/b', '24')` we expect to +see a single `Node`: -```js -{ key: '/a/b', +``` +{ key: 'a/b', value: '24', - trie: [], + trie: + [ ], seq: 0, path: [ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, @@ -203,16 +335,18 @@ Consider a fresh HyperDB. We write `/a/b = 24` and get back this node: 4 ] } ``` -If you compare this path to the one for `/a/b/c` above, you'll see that the -first 64 2-bit characters match. This is because `/a/b` is a prefix of -`/a/b/c`. Since this is the first entry, `seq` is 0. +Note that the first 64 bytes in `path` match those of the `/a/b/c` example from +the [path hashing][path_hash] section, because the first two path components +are the same. Since this is the first entry, `seq` is 0. -Now we write `/a/c = hello` and get this node: +Now we `db.put('/a/c', 'hello')` and expect a second Node: -```js -{ key: '/a/c', +``` +{ key: 'a/c', value: 'hello', - trie: [ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , [ , , [ { feed: 0, seq: 0 } ] ] ], + trie: + [ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , { feed: 0, seq: 0 } ], seq: 1, path: [ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, @@ -220,47 +354,91 @@ Now we write `/a/c = hello` and get this node: 4 ] } ``` -Its `seq` is 1, since the last was 0. Also, this and the previous node have -the first 32 characters of their `path` in common (the prefix `/a`). +The `seq` is incremented to 1. The first 32 characters of `path` are common +with the first Node (they share a common prefix `/a`). -Notice though that `trie` is set. It's a long but sparse array. It has 35 -entries, with the last one referencing the first node inserted (`a/b/`). Why? +`trie` is defined, but mostly sparse. The first 32 elements of common prefix +match the first Node, and then two additional hash elements (`[0, 1]`) happen +to match as well; there is not a differing entry until index 34 (zero-indexed). +At this entry there is a reference pointing to the first Node. An additional 29 +trailing null entries have been trimmed in reduce metadata overhead. -(If it wasn't stored as a sparse array, you'd actually see 64 entries (the -length of the `path`). But since the other 29 entries are also empty, hyperdb -doesn't bother allocating them.) +Next we insert a third node with `db.put('/x/y', 'other')`, and get a third Node: -If you visually compare this node's `path` with the previous node's `path`, how -many entries do they have in common? At which entry do the 2-bit numbers -diverge? +``` +{ key: 'x/y', + value: 'other', + trie: + [ , { feed: 0, seq: 1 } ], + seq: 2, + path: + [ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1, + 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1, + 4 ] } +``` -At the 35th entry. +Consider the lookup-up process for `db.get('/a/b')` (which we expect to +successfully return `'24'`, as written in the first Node). First we calculate +the `path` for the key `a/b`, which will be the same as the first Node. Then we +take the "latest" Node, with `seq=2`. We compare the `path` arrays, starting at +the first element, and find the first difference at index 1 (`1 == 1`, then `1 +!= 2`). We look at index 1 in the current Node's `trie` and find a pointer to +`seq = 1`, so we fetch that Node and recurse. Comparing `path` arrays, we now +get all the way to index 34 before there is a difference. We again look in the +`trie`, find a pointer to `seq = 0`, and fetch the first Node and recurse. Now +the path elements match exactly; we have found the Node we are looking for, and +it has an existant `value`, so we return the `value`. -What this is saying is "if the hash of the key you're looking for differs from -mine on the 35th entry, you want to travel to `{ feed: 0, seq: 0 }` to find the -node you're looking for. +Consider a lookup for `db.get('/a/z')`; this key does not exist, so we expect +to return with "key not found". We calculate the `path` for this key: -This is how finding a node works, starting at any other node: + [ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 1, 2, 3, 0, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 0, 3, 3, 2, 0, 3, 3, 1, 1, 0, 2, 1, 0, 1, 1, 2, + 4 ] -1. Compute the 2-bit hash sequence of the key you're after (e.g. `a/b`) -2. Lookup the newest entry in the feed. -3. Compare its `path` against the hash you just computed. -4. If you discover that the `path` and your hash match, then this is the node - you're looking for! -5. Otherwise, once a 2-bit character from `path` and your hash disagree, note - the index # where they differ and look up that value in the node's `trie`. - Fetch that node at the given feed and sequence number, and go back to step 3. - Repeat until you reach step 4 (match) or there is no entry in the node's trie - for the key you're after (no match). +Similar to the first lookup, we start with `seq = 2` and follow the pointer to +`seq = 1`. This time, when we compare `path` arrays, the first differing entry +is at index `32`. There is no `trie` entry at this index, which tells us that +the key does not exist in the database. -# Examples -[examples]: #examples +## Listing a Prefix + +Continuing with the state of the database above, we call `db.list('/a')` to +list all keys with the prefix `/a`. + +We generate a `path` array for the key `/a`, without the terminating symbol +(`4`): + + [ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2 ] -TODO: lookup a "deep" key +Using the same process as a `get()` lookup, we find the first Node that +entirely matches this prefix, which will be Node `seq = 1`. If we had failed to +find any Node with a complete prefix match, then we would return an empty list +of matching keys. -TODO: deleting a key +Starting with the first prefix-matching node, we save that key as a match +(unless the Node is a deletion), then select all `trie` pointers with an index +higher than the prefix length, and recursively inspect all pointed-to Nodes. -TODO: listing a prefix +## Deleting a Key + +Continuing with the state of the database above, we call `db.delete('/a/c')` to +remove that key from the database. + +The process is almost entirely the same as inserting a new Node at that key, +except that the `value` field is undefined. The new Node (`seq = 3`) is: + +``` +{ key: 'a/c', + value: , + trie: [ , { feed: 0, seq: 2 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , { feed: 0, seq: 0 } ], + seq: 3, + path: + [ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, + 4 ] } +``` # Drawbacks [drawbacks]: #drawbacks @@ -275,12 +453,61 @@ metadata size (and thus disk and network consumption) for archives with few files. +# Overhead and Scaling +[overhead]: #overhead + +The metadata overhead for a single database entry varies based on the size of +the database. In a "heavy" case, considering a two-path-segment key with an +entirely saturated `trie` and `uint32` size feed and sequence pointers, and +ignoring multi-writer fields: + +- `trie`: 4 * 2 * 64 bytes = 512 bytes +- `seq`: 4 bytes +- `path`: 65 bytes +- total: 581 bytes + +In a "light" case, with few `trie` entries and single-byte varint sequence +numbers: + +- `trie`: 2 * 2 * 4 bytes = 16 bytes +- `seqq: 1 byte +- `path`: 65 bytes +- total: 82 + +For a database with most keys having N path segments, the cost of a `get()` +scales with the number of entries M as `O(log(M))` with best case 1 lookup and +worst case `4 * 32 * N = 128 * N` lookups (for a saturated `trie`). + +TODO: prove or verify the above `O(log(M))` intuition + +The cost of a `put()` or `delete()` is proportional to the cost of a `get()`. + +The cost of a `list()` is linear (`O(M)`) in the number of matching entries, +plus the cost of a single `get()`. + +The total metadata overhead for a database with M entries scales with `O(M +* log(M))`. + + # Rationale and alternatives [alternatives]: #alternatives -TODO: scaling issues with existing system; handling many entries +A major motivator for HyperDB is to improve scaling performance with tens of +thousands through millions of files per directory in the existing hyperdrive +implementation. The current implementation requires the most recent node in a +directory to point to all other nodes in the directory. Even with pointer +compression, this requires on the order of `O(N^2)` bytes; the HyperDB +implementation scales with `O(N log(N))`. -TODO: why multi-writer fields included here +The HyperDB specification (this document) is complicated by the inclusion of +new protobuf fields to support "multi-writer" features which are not described +here. The motivation to include these fields now to make only a single +backwards-incompatible schema change, and to make a second software-only change +in the future to enable support for these features. Schema and data format +changes are considered significantly more "expensive" for the community and +software ecosystem compared to software-only changes. Attempts have been made +in this specification to indicate the safe "single-writer-only" values to use +for these fields. # Dat migration logistics @@ -300,14 +527,31 @@ Upgrading a Dat (hyperdrive) archive to HyperDB will necessitate creating a new feed from scratch, meaning new public/private key pairs, and that public key URL links will need to change. +Further logistical details are left to the forthcoming Multi-Writer DEP. + # Unresolved questions [unresolved]: #unresolved-questions -Details of keys to work out (before Draft status): - -- must a prefix end in a trailing slash, or optional? -- can there be, eg, keys `/a/b` and `/a/b/c` at the same time? +How are hash collisions handled? Eg, two keys which have the same `path` hash +array. How unlikely is this situation? Eg, how many keys in a database before +the birthday paradox results in a realistic chance of a collision? How should +implementations behave if they detect a collision (matching path but not +matching key)? + +Need to think through deletion process with respect to listing a path prefix; +will previously deleted nodes be occulded, or potentially show up in list +results? + +Are the "deletion" semantics here correct, in that deletion Nodes persist as +"tombstones", or should deleted keys be omitted from future `trie` fields to +remove their presence? + +Should all-zeros really be used for path hashing, or should we use the +hypercore feed public key? It certainly makes implementation and documentation +simpler to use all-zeros, and potentially makes it easier to copy or migrate +HyperDB content between hypercore feeds. Referencing the feed public key breaks +abstraction layers and the separation of concerns. There are implied "reasonable" limits on the size (in bytes) of both keys and values, but they are not formally specified. Protobuf messages have a hard -- cgit v1.2.3