diff options
-rw-r--r-- | README.md | 2 | ||||
-rw-r--r-- | proposals/0004-hyperdb.md | 760 | ||||
-rw-r--r-- | proposals/0005-dns.md | 169 |
3 files changed, 930 insertions, 1 deletions
@@ -16,7 +16,7 @@ at [docs.datproject.org][docs] is specifically written with end-users and application developers in mind. [github-deps]: https://github.com/datprotocol/DEPs/ -[docs]: https://docs.datproject.com +[docs]: https://docs.datproject.org ## The Process diff --git a/proposals/0004-hyperdb.md b/proposals/0004-hyperdb.md new file mode 100644 index 0000000..d551b64 --- /dev/null +++ b/proposals/0004-hyperdb.md @@ -0,0 +1,760 @@ + +Title: **DEP-0004: Hyperdb** + +Short Name: `0004-hyperdb` + +Type: Standard + +Status: Draft (as of 2018-05-06) + +Github PR: [Draft](https://github.com/datprotocol/DEPs/pull/3) + +Authors: +[Bryan Newbold](https://github.com/bnewbold), +[Stephen Whitmore](https://github.com/noffle), +[Mathias Buus](https://github.com/mafintosh) + + +# Summary +[summary]: #summary + +Hyperdb is an abstraction layer providing a general purpose distributed +key/value store over the hypercore protocol. It is an iteration on the +hyperdrive directory tree implementation, building on top of the hypercore +append-only log abstraction layer. Keys are path-like strings (e.g., +`/food/fruit/kiwi`), and values are arbitrary binary blobs (generally under a +megabyte). + +Hyperdrive (used by the Dat application) is expected to be re-implemented on +top of hyperdb for improved performance with many files (e.g., millions). The +hyperdrive API should be largely unchanged, but the `metadata` format will be +backwards-incompatible. + + +# Motivation +[motivation]: #motivation + +Hyperdb is expected to drastically improve performance of dat clients when +working with archives containing tens of thousands of files in single +directories. This is a real-world bottleneck for several current users, with +basic local actions such as adding a directory taking an unacceptably long time +to complete. + +A secondary benefit is to refactor the [trie][trie]-structured key/value API +out of hyperdrive, allowing third party code to build applications directly on +this abstraction layer. + +[trie]: https://en.wikipedia.org/wiki/Trie + + +# Usage Documentation +[usage-documentation]: #usage-documentation + +*This section describes Hyperdb's interface and behavior in the abstract for +application programmers. It is not intended to be exact documentation of any +particular implementation (including the reference Javascript module).* + +Hyperdb is structured to be used much like a traditional hierarchical +filesystem. A value can be written and read at locations like `/foo/bar/baz`, +and the API supports querying or tracking values at subpaths, like how watching +for changes on `/foo/bar` will report both changes to `/foo/bar/baz` and also +`/foo/bar/19`. + +Lower-level details of the hypercore append-only log, disk serialization, and +networked synchronization features that Hyperdb builds on top of are not +described in detail here; see the [DEP repository][deps]. Multi-writer +hypercore semantics are also not discussed in this DEP. + +[deps]: https://github.com/datprotocol/DEPs + +A Hyperdb database instance can be represented by a single hypercore feed (or +several feeds in a multi-writer context), and is named, referenced, and +discovered using the public and discovery keys of the hypercore feed (or the +original feed if there are several). In a single-writer configuration, only a +single node (holding the secret key) can mutate the database (e.g., via `put` or +`delete` actions). + +**Keys** can be any UTF-8 string. Path segments are separated by the forward +slash character (`/`). Repeated slashes (`//`) are disallowed. Leading and +trailing `/` are optional in application code: `/hello` and `hello` are +equivalent. A key can be both a "path segment" and key at the same time; e.g., +`/a/b/c` and `/a/b` can both be keys at the same time. + +**Values** can be any binary blob, including empty (of zero length). For +example, values could be UTF-8 encoded strings, JSON encoded objects, protobuf +messages, or a raw `uint64` integer (of either endian-ness). Length is the only +form of type or metadata stored about the value; deserialization and validation +are left to library and application developers. + + +## Core API Semantics +[core_api]: #core_api + +A `db` is instantiated by opening an existing hypercore with hyperdb content +(read-only, or optionally read-write if the secret key is available), or +creating a new one. A handle could represent any specific revision in history, +or the "latest" revision. + +`db.put(key, value)`: inserts `value` (arbitrary bytes) under the path `key`. +Requires read-write access. Returns an error (e.g., via callback) if there was a +problem. + +`db.get(key)`: Reading a non-existent `key` is an error. Read-only. + +`db.delete(key)`: Removes the key from the database. Deleting a non-existent +key is an error. Requires read-write access. + +`db.list(prefix)`: returns a flat (not nested) list of all keys currently in +the database under the given prefix. Prefixes operate on a path-segment basis: +`/ab` is not a valid prefix for key `/abcd`, but is valid for `/ab/cd`. If the +prefix does not exist, returns an empty list. The order of returned keys is +implementation (or configuration) specific. Default listing is recursive +(implementations may have a flag to control this behavior). Read-only. + +If the hypercore underlying a hyperdb is only partially replicated, behavior is +implementation-specific. For example, a `get()` call could block until the +relevant value is replicated, or the implementation could return an error. + +An example pseudo-code session working with a database might be: + + db.put('/life/animal/mammal/kitten', '{"cuteness": 500.3}') + db.put('/life/plant/bush/banana', '{"delicious": 103.4}') + db.delete('/life/plant/bush/banana') + db.put('/life/plant/tree/banana', '{"delicious": 103.4}') + db.get('/life/animal/mammal/kitten') + => {"cuteness": 500.3} + db.list('/life/') + => ['/life/animal/mammal/kitten', '/life/plant/tree/banana'] + + +# Reference Documentation +[reference-documentation]: #reference-documentation + +A hyperdb hypercore feed typically consists of a sequence of protobuf-encoded +messages of "Entry" type. Higher-level protocols may make exception to this, +for example by prepending an application-specific metadata message as the first +entry in the feed. There is sometimes a second "content" feed associated with +the primary hyperdb key/value feed, to store data that does not fit in the +(limited) `value` size constraint. + +The sequence of entries includes an incremental index: the most recent entry in +the feed contains metadata pointers that can be followed to efficiently look up +any key in the database without needing to linear scan the entire history or +generate an independent index data structure. Implementations are, of course, +free to maintain their own index if they prefer. + +The Entry protobuf message schema is: + + message Entry { + message Feed { + required bytes key = 1; + } + + required string key = 1; + optional bytes value = 2; + required bytes trie = 3; + repeated uint64 clock = 4; + optional uint64 inflate = 5; + repeated Feed feeds = 6; + optional bytes contentFeed = 7; + } + +Some fields are are specific to the multi-writer features described in their +own DEP and mentioned only partially here. The fields common to both message +types are: + +- `key`: UTF-8 key that this node describes. Leading and trailing forward + slashes (`/`) are always stripped before storing in protobuf. +- `value`: arbitrary byte array. A non-existent `value` entry indicates that + this Entry indicates a deletion of the key; this is distinct from specifying + an empty (zero-length) value. +- `trie`: a structured array of pointers to other Entry entries in the feed, + used for navigating the tree of keys. +- `clock`: reserved for use in the forthcoming `multi-writer` standard. An + empty list is the safe (and expected) value for `clock` in single-writer use + cases. +- `inflate`: a "pointer" (reference to a feed index number) to the most recent + entry in the feed that has the optional `feeds` and `contentFeed` fields set. + Not defined if `feeds` and `contentFeed` are defined in the same message. +- `feeds`: reserved for use with `multi-writer`. The safe single-writer value is + to use the current feed's hypercore public key. +- `contentFeed`: for applications which require a parallel "content" hypercore + feed for larger data, this field can be used to store the 32-byte public key + for that feed. This field should have a single value for the entire history + of the feed (aka, it is not mutable). + +For the case of a single-writer feed, not using multi-writer features, it is +sufficient to write a single Entry message as the first entry in the hypercore +feed, with `feeds` containing a single entry (a pointer to the current feed +itself), and `contentFeed` optionally set to a pointer to a paired content +feed. + +If *either* `feeds` *or* `contentFeed` are defined in an entry, *both* fields +must be defined, so the new message can be referred to via `inflated`. In this +case the entry is refereed to as an "inflated entry". + + +## Path Hashing +[path_hashing]: #path_hashing + +Every key path has component-wise fixed-size hash representation that is used +by the trie. The concatenation of all path hashes yields a "path hash array" +for the entire key. Note that analogously to a hash map data structure, there +can be more than one key (string) with the same key hash in the same database +with no problems: the hash points to a linked-list "bucket" of Entries, which +can be iterated over linearly to find the correct value. + +The path hash is represented by an array of bytes. Elements are 2-bit encoded +(values 0, 1, 2, 3), except for an optional terminating element which has value +4. Each path element consists of 32 values, representing a 64-bit hash of that +path element. For example, the key `/tree/willow` has two path segments (`tree` +and `willow`), and will be represented by a 65 element path hash array (two 32 +element hashes plus a terminator). + +The hash algorithm used is `SipHash-2-4`, with an 8-byte output and +16-byte key; the input is the UTF-8 encoded path string segment, without any +`/` separators or terminating null bytes. An implementation of this hash +algorithm is included in the libsodium library in the form of the +`crypto_shorthash()` function. A 16-byte "secret" key is required; for this use +case we use all zeros. + +When converting the 8-bytehash to an array of 2-bit bytes, the ordering is +proceed byte-by-byte, and for each take the two lowest-value bits (aka, `hash & +0x3`) as byte index `0`, the next two bits (aka, `hash & 0xC`) as byte index +`1`, etc. When concatenating path hashes into a longer array, the first +("left-most") path element hash will correspond to byte indexes 0 through 31; +the terminator (`4`) will have the highest byte index. + +For example, consider the key `/tree/willow`. `tree` has a hash `[0xAC, 0xDC, +0x05, 0x6C, 0x63, 0x9D, 0x87, 0xCA]`, which converts into the array: + +``` +[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3 ] +``` + + +`willow` has a 64-bit hash `[0x72, 0x30, 0x34, 0x39, 0x35, 0xA8, 0x21, 0x44]`, +which converts into the array: + +``` +[ 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1 ] +``` + +These combine into the unified byte array with 65 elements: + +``` +[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3, + 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1, + 4 ] +``` + +As another example, the key `/a/b/c` converts into the 97-byte hash array: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3, + 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, + 4 ] +``` + +Note that "hash collisions" are rare with this hashing scheme, but are likely +to occur with large databases (millions of keys), and collision have been +generated as a proof of concept. Implementations should take care to properly +handle collisions by verifying keys and following bucket pointers (see the next +section). + +An example hash collision (useful for testing; thanks to Github user +`dcposch`): + + /mpomeiehc + /idgcmnmna + +<!-- + +Generation code (javascript) for the above: + + var sodium = require('sodium-universal') + var toBuffer = require('to-buffer') + + var KEY = Buffer.alloc(16) + var OUT = Buffer.alloc(8) + + sodium.crypto_shorthash(OUT, toBuffer('tree'), KEY) + console.log("tree: ", OUT) + console.log(hash('tree', true)) + + sodium.crypto_shorthash(OUT, toBuffer('willow'), KEY) + console.log("willow: ", OUT) + console.log(hash('willow', true)) + + sodium.crypto_shorthash(OUT, toBuffer('a'), KEY) + console.log("a: ", OUT) + console.log(hash('a', true)) + +Then, to manually "expand" arrays in python3: + + hash_array = [0x72, 0x30, 0x34, 0x39, 0x35, 0xA8, 0x21, 0x44] + path = [] + tmp = [(x & 0x3, (x >> 2) & 0x3, (x >> 4) & 0x3, (x >> 6) & 0x3) for x in hash_array] + [path.extend(e) for e in tmp] + path + +--> + + +## Incremental Index Trie +[trie_index]: #trie_index + +Each node stores a *prefix [trie](https://en.wikipedia.org/wiki/Trie)* that +can be used to look up other keys, or to list all keys with a given prefix. +This is stored under the `trie` field of the Entry protobuf message. + +The trie effectively mirrors the path hash array. Each element in the `trie` +is called a "bucket". Each non-empty bucket points to the newest Entries which +have an identical path up to that specific prefix location; because the trie +has 4 "values" at each node, there can be pointers to up to 3 other values at a +given element in the trie array. Buckets can be empty if there are no nodes +with path hashes that differ for the first time the given bucket (aka, there +are no "branches" at this node in the trie). Only non-null elements will be +transmitted or stored on disk. + +The data structure of the trie is a sparse array of pointers to other Entry +entries. Each pointer indicates a feed index and an entry index pointer, and is +associated with a 2-bit value; for the non-multi-writer case, the feed index is +always 0, so we consider only the entry index. + +For a `trie` with `N` buckets, each may have zero or more pointers. Typically +there are a maximum of 3 pointers per bucket, corresponding to the 4 possible +values minus the current Entry's value, but in the event of hash collisions (in +the path array space), there may be multiple pointers in the same bucket +corresponding to the same value. + +To lookup a key in the database, the recipe is to: + +1. Calculate the path hash array for the key you are looking for. +2. Select the most-recent ("latest") Entry for the feed. +3. Compare path hash arrays. If the paths match exactly, compare keys; they + match, you have found the you were looking for! Check whether the `value` is + defined or not; if not, this Entry represents that the key was deleted from + the database. +4. If the paths match, but not the keys, look for a pointer in the last `trie` + array index, and iterate from step #3 with the new Entry. +5. If the paths don't entirely match, find the first index at which the two + arrays differ, and look up the corresponding element in this Entry's `trie` + array. If the element is empty, or doesn't have a pointer corresponding to + your 2-bit value, then your key does not exist in this hyperdb. +6. If the trie element is not empty, then follow that pointer to select the + next Entry. Recursively repeat this process from step #3; you will be + descending the `trie` in a search, and will either terminate in the Entry you + are looking for, or find that the key is not defined in this hyperdb. + +Similarly, to write a key to the database: + +1. Calculate the path hash array for the key, and start with an empty `trie` of + the same length; you will write to the `trie` array from the current index, + which starts at 0. +2. Select the most-recent ("latest") Entry for the feed. +3. Compare path hash arrays. If the paths match exactly, and the keys match, then + you are overwriting the current Entry, and can copy the "remainder" of it's + `trie` up to your current `trie` index. +4. If the paths match, but not the keys, you are adding a new key to an + existing hash bucket. Copy the `trie` and extend it to the full length. Add + a pointer to the Entry with the same hash at the final array index. +5. If the paths don't entirely match, find the first index at which the two + arrays differ. Copy all `trie` elements (empty or not) into the new `trie` + for indices between the "current index" and the "differing index". +6. Next look up the corresponding element in this Entry's `trie` array at the + differing index. If this element is empty, then you have found the most + similar Entry. Write a pointer to this node to the `trie` at + the differing index, and you are done (all remaining `trie` elements are + empty, and can be omitted). +7. If the differing tree element has a pointer (is not empty), then follow that + pointer to select the next `Entry`. Recursively repeat this process from step + #3. + +To delete a value, follow the same procedure as adding a key, but write the +`Entry` without a `value` (in protobuf, this is distinct from having a `value` +field with zero bytes). Deletion nodes may persist in the database forever. + + +## Binary Trie Encoding +[trie_encoding]: #trie_encoding + +The following scheme is used to encode trie data structures (sparse, indexed +arrays of pointers to entries) into a variable-length bytestring as the `trie` +field of an Entry protobuf message. + +Consider a trie array with `N` buckets and `M` non-empty buckets (`0 <= M <= +N`). In the encoded field, there will be `M` concatenated bytestrings of the +form: + +- trie index (varint), followed by... +- bucket bitfield (packed in a varint), encoding which of the 5 values (4 + values if the index is not modulo 32) at this node of the trie have pointers, + followed by one or more... +- pointer sets, each referencing an entry at (feed index, entry index): + - feed index (varint, with a extra "more pointers at this value" low bit, + encoded as `feed_index << 1 | more_bit`) + - entry index (varint) + +In the common case for a small/sparse hyperdb, there will a small number of +non-empty buckets (small `M`), a usually a single `(feed index, entry index)` +pointer for those non-empty buckets. For a very large/dense hyperdb (millions +of key/value pairs), there will be many non-empty buckets (`M` approaching +`N`), and buckets may have up to the full 4 pointer sets. Even with millions of +entries, hash collisions will be very rare; in those cases there will be +multiple pointers in the same pointer set. + +Consider an entry with path hash: + +``` +[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1, + 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1, + 4 ] +``` + +and trie: + +``` +[ , { val: 1, feed: 0, index: 1 } ] +``` + +In this case `N` is 64 (or you could count as 2 if you ignore trailing empty +entries) and `M` is 1. There will be a single bytestring chunk: + +- index varint is `1` (second element in the trie array) +- bitfield is `0b0010` (varint 2): there is only a pointer set for value 1 (the second value) +- there is a single pointer in the pointer set, which is (`feed=0 << 1 | 0`, + `index=1`), or (varint 2, varint 1) + +Combined, the `trie` bytestring will be: + +``` +[0x01, 0x02, 0x02, 0x02] +``` + +For a more complex example, consider the same path hash, but the trie: + +``` +[ , { val: 1, feed: 0, index: 1; val: 2, feed: 5, index: 3; val: 3, feed: 6, index: 98 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + { val: 4, feed: 0, index, 23; val: 4, feed: 1, index: 17 ] +``` + +Now `M` is 2. The first bytestring chunk will be: + +- index varint is `1` (second element in the trie array) +- bitfield is `0b01110` (varint 9): there are three pointer sets +- first pointer set is (`feed=0 << 1 | 0`, `index=1`) or (varint 2, varint 1) +- second pointer set is (`feed=5 << 1 | 0`, `index=3`) or (varint 10, varint 3) +- third pointer set is (`feed=6 << 1 | 0`, `index=98`) or (varint 12, varint 98) + +the second bytestring chunk would be: + +- index varint is `64` (65th element in the trie array; the terminating value) +- bitfield is `0b10000` (varint 1); there is a single pointer set... but it + contains a hash collision, so there are two pointers +- first pointer is (`feed=0 << 1 | 1`, `index=23`) or (varint 1, varint=23); + note the `more` bit is set high! +- second pointer is (`feed=1 << 1 | 0`, `index=17`) or (varint 2, varint 17); + note the `more` bit is low, as usual. In the extremely unlikely case of + multiple collisions there could have been more pointers with `more` high + preceding this one. + +The overall bytestring would be: + +``` +[0x01, 0x09, 0x02, 0x01, 0x0A, 0x03, 0x0C, 0x62, 0x40, 0x10, 0x01, 0x17, 0x02, 0x11] +``` + + +# Examples +[examples]: #examples + + +## Simple Put and Get + +Starting with an empty hyperdb `db`, if we `db.put('/a/b', '24')` we expect to +see a single `Entry` and index 0: + +``` +{ key: 'a/b', + value: '24', + trie: + [ ] } +``` + +For reference, the path hash array for this key (index 0) is: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3, + 4 ] +``` + +Note that the first 64 bytes in path match those of the `/a/b/c` example from +the [path hashing][path_hash] section, because the first two path components +are the same. Since this is the first entry, the entry index is 0. + +Now we `db.put('/a/c', 'hello')` and expect a second Entry: + +``` +{ key: 'a/c', + value: 'hello', + trie: + [ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , { element: 2, feed: 0, index: 0 } ] } +``` + +The path hash array for this key (index 1) is: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, + 4 ] +``` + +The first 32 characters of path are common with the first Entry (they share a +common prefix, `/a`). + +`trie` is defined, but mostly sparse. The first 32 elements of common prefix +match the first Entry, and then two additional hash elements (`[0, 1]`) happen +to match as well; there is not a differing entry until index 34 (zero-indexed). +At this entry there is a reference pointing to the first Entry. An additional 29 +trailing null entries have been trimmed in reduce metadata overhead. + +Next we insert a third node with `db.put('/x/y', 'other')`, and get a third Entry: + +``` +{ key: 'x/y', + value: 'other', + trie: + [ , { val: 1, feed: 0, index: 1 } ], +``` + +The path hash array for this key (index 2) is: + +``` +[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1, + 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1, + 4 ] +``` + +Consider the lookup-up process for `db.get('/a/b')` (which we expect to +successfully return `'24'`, as written in the first Entry). First we calculate +the path for the key `a/b`, which will be the same as the first Entry. Then we +take the "latest" Entry, with entry index 2. We compare the path hash arrays, +starting at the first element, and find the first difference at index 1 (`1 == +1`, then `1 != 2`). We look at index 1 in the current Entry's `trie` and find a +pointer to entry index 1, so we fetch that Entry and recurse. Comparing path +hash arrays, we now get all the way to index 34 before there is a difference. +We again look in the `trie`, find a pointer to entry index 0, and fetch the +first Entry and recurse. Now the path elements match exactly; we have found the +Entry we are looking for, and it has an existent `value`, so we return the +`value`. + +Consider a lookup for `db.get('/a/z')`; this key does not exist, so we expect +to return with "key not found". We calculate the path hash array for this key: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 1, 2, 3, 0, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 0, 3, 3, 2, 0, 3, 3, 1, 1, 0, 2, 1, 0, 1, 1, 2, + 4 ] +``` + +Similar to the first lookup, we start with entry index 2 and follow the pointer to +entry index 1. This time, when we compare path hash arrays, the first differing +entry is at array index `32`. There is no `trie` entry at this index, which +tells us that the key does not exist in the database. + +## Listing a Prefix + +Continuing with the state of the database above, we call `db.list('/a')` to +list all keys with the prefix `/a`. + +We generate a path hash array for the key `/a`, without the terminating symbol +(`4`): + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2 ] +``` + +Using the same process as a `get()` lookup, we find the first Entry that +entirely matches this prefix, which will be entry index 1. If we had failed to +find any Entry with a complete prefix match, then we would return an empty list +of matching keys. + +Starting with the first prefix-matching node, we save that key as a match +(unless the Entry is a deletion), then select all `trie` pointers with an index +higher than the prefix length, and recursively inspect all pointed-to Entries. + +## Deleting a Key + +Continuing with the state of the database above, we call `db.delete('/a/c')` to +remove that key from the database. + +The process is almost entirely the same as inserting a new Entry at that key, +except that the `value` field is undefined. The new Entry (at entry index 3) +is: + +``` +{ key: 'a/c', + value: , + trie: [ , { val: 1, feed: 0, index: 2 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , + , , { val: 1, feed: 0, index: 0 } ] } +``` + +The path hash array for this Entry (key) is: + +``` +[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2, + 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, + 4 ] +``` + + +# Drawbacks +[drawbacks]: #drawbacks + +A backwards-incompatible change will have negative effects on the broader dat +ecosystem: clients will need to support both versions protocol for some time +(increasing maintenance burden), future clients may not inter-operate with old +archives, etc. These downsides can partially be avoided by careful roll-out. + +For the specific use case of Dat archives, hyperdb will trivially increase +metadata size (and thus disk and network consumption) for archives with few +files. + + +# Overhead and Scaling +[overhead]: #overhead + +The metadata overhead for a single database entry varies based on the size of +the database. In a "heavy" case, considering a two-path-segment key with an +entirely saturated `trie` and `uint32`-sized feed and entry index pointers, and +ignoring multi-writer fields: + +- `trie`: 4 * 2 * 64 bytes = 512 bytes +- total: 512 bytes + +In a "light" case, with few `trie` entries and single-byte varint feed and +entry index pointers: + +- `trie`: 2 * 2 * 4 bytes = 16 bytes +- total: 16 + +For a database with most keys having N path segments, the cost of a `get()` +scales with the number of entries M as `O(log(M))` with best case 1 lookup and +worst case `4 * 32 * N = 128 * N` lookups (for a saturated `trie`). + +The cost of a `put()` or `delete()` is proportional to the cost of a `get()`. + +The cost of a `list()` is linear (`O(M)`) in the number of matching entries, +plus the cost of a single `get()`. + +The total metadata overhead for a database with M entries scales with `O(M +* log(M))`. + + +# Privacy and Security Considerations +[privacy]: #privacy + +The basic key/value semantics of hyperdb (as discussed in this DEP, not +considering multi-writer changes) are not known to introduce new privacy issues +when compared with, e.g., storing binary values at key-like paths using the +current ("legacy") hyperdrive system. + +A malicious writer could cause trouble for readers, even readers who do not +trust the application-level contents of a feed. Implementations which may be +exposed to arbitrary feeds from unknown sources on the internet should take +care to the following scenarios: A malicious writer may be able to produce very +frequent key path hash collisions, which could degrade to linear performance. A +malicious writer could send broken trie structures that contain pointer loops, +duplicate pointers, or other invalid contents. A malicious writer could write +arbitrary data in value fields in an attempt to exploit de-serialization bugs. +A malicious writer could leverage non-printing unicode characters to create +database entries with user-indistinguishable names (keys). + + +# Rationale and alternatives +[alternatives]: #alternatives + +A major motivator for hyperdb is to improve scaling performance with tens of +thousands through millions of files per directory in the existing hyperdrive +implementation. The current implementation requires the most recent node in a +directory to point to all other nodes in the directory. Even with pointer +compression, this requires on the order of `O(N^2)` bytes; the hyperdb +implementation scales with `O(N log(N))`. + +The hyperdb specification (this document) is complicated by the inclusion of +new protobuf fields to support "multi-writer" features which are not described +here. The motivation to include these fields now to make only a single +backwards-incompatible schema change, and to make a second software-only change +in the future to enable support for these features. Schema and data format +changes are considered significantly more "expensive" for the community and +software ecosystem compared to software-only changes. Attempts have been made +in this specification to indicate the safe "single-writer-only" values to use +for these fields. + + +# Dat migration logistics +[migration]: #migration + +Hyperdb is not backwards compatible with the existing hyperdrive metadata, +meaning dat clients may need to support both versions during a transition +period. This applies both to archives saved to disk (e,g., in SLEEP) and to +archives received and published to peers over the network. + +No changes to the Dat network wire protocol itself are necessary, only changes +to content passed over the protocol. The Dat `content` feed, containing raw +file data, is not impacted by hyperdb, only the contents of the `metadata` +feed. + +Upgrading a Dat (hyperdrive) archive to hyperdb will necessitate creating a new +feed from scratch, meaning new public/private key pairs, and that public key +URL links will need to change. + +Further logistical details are left to the forthcoming Multi-Writer DEP. + + +# Unresolved questions +[unresolved]: #unresolved-questions + +Need to think through deletion process with respect to listing a path prefix; +will previously deleted nodes be occluded, or potentially show up in list +results? Should be reviewed (by a non-author of this document) before accepted +as a Draft. + +Can the deletion process (currently leaving "tombstone" entries in the `trie` +forever) be improved, such that these entries don't need to be iterated over? +mafintosh mentioned this might be in the works. Does this DEP need to "leave +room" for those changes, or should we call out the potential for future change? +Probably not, should only describe existing solutions. This can be resolved +after Draft. + +There are implied "reasonable" limits on the size (in bytes) of both keys and +values, but they are not formally specified. Protobuf messages have a hard +specified limit of 2 GByte (due to 32-bit signed arithmetic), and most +implementations impose a (configurable) 64 MByte limit. Should this DEP impose +specific limits on key and value sizes? Would be good to decide before Draft +status. + +Apart from leaving fields in the protobuf message specification, multi-writer +concerns are explicitly out of scope for this DEP. + + +# Changelog +[changelog]: #changelog + +As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb +nodejs module on [github](https://github.com/mafintosh/hyperdb), which is the +basis for this DEP. + +- 2017-12-06: Stephen Whitmore (@noffle) publishes `ARCHITECTURE.md` overview + in the [hyperdb github repo][arch_md] +- 2018-03-04: First draft for review +- 2018-03-15: Hyperdb v3.0.0 is released +- 2018-04-18: This DEP submitted for Draft review. +- 2018-05-06: Merged as Draft after WG approval. + +[arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md diff --git a/proposals/0005-dns.md b/proposals/0005-dns.md new file mode 100644 index 0000000..eb147c5 --- /dev/null +++ b/proposals/0005-dns.md @@ -0,0 +1,169 @@ + +Title: **DEP-0005: DNS** + +Short Name: `0005-dns` + +Type: Informative + +Status: Draft (as of 2018-04-27) + +Github PR: [PR#25](https://github.com/datprotocol/DEPs/pull/25) + +Authors: Paul Frazee + + +# Summary +[summary]: #summary + +Dat's data structures (HyperCore, HyperDB, and HyperDrive) are addressed using +cryptographic keys. In the context of Web browsers, a URL scheme is used which +is structured as `'dat://' {key} '/' {path...}`. + +This DEP describes an additional protocol for addressing Dats using DNS short names. + + +# Motivation +[motivation]: #motivation + +The cryptographic keys which address Dats are secure and global, but not human-readable. This creates a usability problem. + +The goal of this DEP is to leverage DNS to provide human-readable shortnames which map to Dat's cryptographic addresses. The solution... + + - Must provide a single canonical Dat URL for the domain. It should not be possible for a name to have multiple valid mappings. (No conflict.) + - Must not be controllable by non-owners of the domain. It should not be possible for third parties to modify the mapping. (Secure.) + - Should be accessible to as many users as possible. (Convenient.) + +An initial proposal (the [".well-known solution"](https://web.archive.org/web/20171013151452/https://github.com/beakerbrowser/beaker/wiki/Authenticated-Dat-URLs-and-HTTPS-to-Dat-Discovery)) has had prior usage in the Beaker Browser and Dat CLI. A followup proposal (the ["DNS TXT solution"](https://web.archive.org/web/20180427202745/https://github.com/beakerbrowser/beaker/wiki/Dat-DNS-TXT-records-with-optional-DNS-over-HTTPS)) was made recently, but has not yet been deployed. This DEP intends to unify these proposals formally. + + +# Usage Documentation +[usage-documentation]: #usage-documentation + +Users may provide a Dat URL to clients with the following structure: `'dat://' {name} '/'`. + +If the `name` is a 64-character hex string, it should be considered a "key" and no DNS-resolution should occur. If `name` matches the following RegExp, it is considered a "key": + +``` +^[0-9a-f]{64}$ +``` + +If the `name` does not match this RegExp, it is a "domain name" and requires resolution. A Dat client should follow the [resolution process](#resolution-process) to do this. + +``` +domain: dat://beakerbrowser.com/ +key: dat://87ed2e3b160f261a032af03921a3bd09227d0a4cde73466c17114816cae43336/ +``` + +Users have multiple options for creating a domain-name mapping. + +## DNS TXT record +[usage-dns-txt-record]: #usage-dns-txt-record + +The first option is to set a DNS TXT record at the domain which maps to a "key"-addressed Dat URL. The client will lookup this TXT record first and load the resulting Dat. That record should fit the following schema: + +``` +datkey={key} +``` + +## .well-known/dat +[usage-wellknown-dat]: #usage-wellknown-dat + +The second option is to run an HTTPS server at the domain name which includes a `/.well-known/dat` resource. That resource should provide a text file with the following schema: + +``` +dat://{key} +TTL={time in seconds} +``` + +`TTL` is optional and will default to `3600` (one hour). If set to `0`, the entry is not cached. + + +# Resolution process +[resolution-process]: #resolution-process + +Resolution of a Dat at `dat://{name}` should follow this process: + + - Client checks its names cache. If a non-expired entry is found, return with the entry. + - Client issues a DNS TXT request for `name`. This request should be issued via a secure transport (see ["DNS-over-HTTPS"](#dns-over-https)). + - Client iterates all TXT records given (skip if none). If a record's value matches the TXT record schema (see below): + - If the record includes a non-zero TTL, store the record value in the names cache. + - Return the record value. + - Client issues an HTTPS GET request to `https://{name}/.well-known/dat`. + - If the server responds with a `404 Not Found` status, client stores a `null` entry in the names cache with a TTL of 3600 and returns a failed lookup. + - If the server responds with anything other than a `200 OK` status, return a failed lookup. + - If the server responds with a malformed file (see below), return a failed lookup. + - If the server responds with a well-formed file, store the record value in the names cache (default TTL to `3600` if not provided) and return the record value. + +The DNS TXT record must match this schema: + +``` +'datkey=' [0-9a-f]{64} +``` + +The `/.well-known/dat` file must match this schema: + +``` +'dat://' [0-9a-f]{64} '/'? +( 'TTL=' [0-9]* )? +``` + +Note that DNS-record responses may not follow a pre-defined order. Therefore the results of a lookup may be undefined if multiple TXT records exist. + + +# Security and Privacy +[security-and-privacy]: #security-and-privacy + +Two issues to consider: + + - **Security**: Can we trust the lookup results for a name? + - **Privacy**: Who sees the DNS lookups? + +Traditional DNS provides neither security or privacy. All looks occur over plaintext UDP. To provide security, a separate system must authenticate the record. (In the case of HTTPS records, the SSL Certificate provides authentication.) + +Dat does not currently have a DNS authentication record (no equivalent to the SSL certificate). Therefore a lookup using UDP can not be secured. + +To solve this, this DEP recommends using [DNS-over-HTTPS](#dns-over-https). + + +## DNS-over-HTTPS +[dns-over-https]: #dns-over-https + +Until PKI can authenticate the DNS lookups (ie via SSL certificates or equivalent) there is a risk that the DNS lookup will be intercepted by an adversary. To protect against this, the client should use DNS-over-HTTPS to lookup the DNS TXT records. + +Current providers: + + - [Google](https://developers.google.com/speed/public-dns/docs/dns-over-https) + - [Cloudflare](https://developers.cloudflare.com/1.1.1.1/dns-over-https/json-format/) + +This solution improves on both the security and privacy of DNS lookup: + + - **Security**. Requests to the DNS provider are authenticated using the provider's SSL certificate. + - **Privacy**. DNS lookups are encrypted on the wire and only made visible to the DNS provider. + +DNS-over-HTTPS still requires trust in the provider to give correct responses, but this is an improvement to UDP DNS lookups, which can be trivially MITMed by malicious actors on the network. + +Whereas traditional DNS leaks name lookups to everyone on the network, DNS-over-HTTPS only reveals them to the DNS provider. This still provides some opportunity for tracking, but the opportunity is reduced to the provider alone. + + +# Drawbacks +[drawbacks]: #drawbacks + + - The use of the `.well-known/dat` resource over HTTPS creates a dependency on a service. + - DNS-over-HTTPS exposes all lookups to the provider and relies on the provider to be honest. However, because this method offsets the risk of MITM attacks, this is a worthwhile trade. Future DEPs should find alternative ways to authenticate domain-name records. + + +# Rationale and alternatives +[alternatives]: #alternatives + +- User-defined names registries could be used instead of DNS, but they would likely suffer from name conflicts without a top-down control. +- A blockchain (such as Namecoin) could be used instead of DNS, but blockchains currently have poor throughput and require users to sync large amounts of data. +- DNSSEC could be used instead of DNS-over-HTTPS, but it does not have the same level of support among gTLDs that DNS-over-HTTPS has. + + +# Changelog +[changelog]: #changelog + +- 2018-04-27: First complete draft submitted for review +- 2018-05-07: Add "Security and Privacy" section and rewrite DNS TXT record schema. +- 2018-05-16: Merged as Draft after WG approval. + |