From fc894b547e6eaeea37420bb902e62695b341cb1c Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 4 Mar 2018 17:22:45 -0800 Subject: WIP on multiwriter DEP --- proposals/0000-multiwriter.md | 255 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 255 insertions(+) create mode 100644 proposals/0000-multiwriter.md diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md new file mode 100644 index 0000000..ec5d2d3 --- /dev/null +++ b/proposals/0000-multiwriter.md @@ -0,0 +1,255 @@ + +Title: **DEP-0000: Multi-Writer** + +Short Name: `0000-multiwriter` + +Type: Standard + +Status: Undefined (as of 2018-03-XX) + +Github PR: (add HTTPS link here after PR is opened) + +Authors: +[Bryan Newbold](https://github.com/bnewbold), +[Stephen Whitmore](https://github.com/noffle), +[Mathias Buus](https://github.com/mafintosh) + + +# Summary +[summary]: #summary + +Multi-Writer is a set of schema, API, and feature extentions to multiple agents +(users, devices, or software) to write to the same HyperDB feed. By building on +top of this abstraction layer, future versions of hyperdrive and Dat will gain +these features. + +Mechanisms for distributed consistency and granting trust are specified here; +the need for merge conflict algorithms and secure key distribution are +mentioned but specific solutions are not specified. + + +# Motivation +[motivation]: #motivation + +The current hypercore/Dat ecosystem currently lacks solutions two fundamental +use cases: + +- individual users should be able to modify distributed archives under their + control from multiple devices, at a minimum to prevent loss of control of + content if a single device (containing secret keys) is lost +- contributions from and collaboration between multiple users on a single + archive or database should be possible, with appropriate trust and access + control semantics + +Access to a single secret key is currently required to make any change to a +hypercore feed, and it is broadly considered best practice not to distribute +secret keys between multiple users or multiple devices. In fact, the current +hypercore implementation has no mechanism to resolve disputes or recover if +multiple agents used the same secret key to append to the same feed. + +Solutions to these two use cases are seen as essential for many current and +future Dat ecosystem applications. + + +# Semantics and Usage +[usage-documentation]: #usage-documentation + +TODO: semantics, terminology + +TODO: things left to application developers (secure key distribution, non-trivial merge resolution) + +TODO: brief note on scaling properties (eg, reasonable numbers of feeds per database) + + +# Implementation +[reference-documentation]: #reference-documentation + +The protobuf schema fields of interest for multi-writer (from the "Node" +message type specified in the HyperDB DEP) are: + +- `seq`: the sequence number of this entry in the owner's hypercore. 0 is the + first, 1 the second, and so forth. +- `feed`: the ID of the hypercore writer that wrote this +- `clock`: vector clock to determine node insertion causality + +## Directed acyclic graph + +The combination of all operations performed on a HyperDB by all of its members +forms a DAG (*directed acyclic graph*). Each write to the database (setting a +key to a value) includes information to point backward at all of the known +"heads" in the graph. + +To illustrate what this means, let's say Alice starts a new HyperDB and writes 2 +values to it: + +``` +// Feed + +0 (/foo/bar = 'baz') +1 (/foo/2 = '{ "some": "json" }') + + +// Graph + +Alice: 0 <--- 1 +``` + +Where sequence number 1 (the second entry) refers to sequence number 0 on the +same feed (Alice's). + +Now Alice *authorizes* Bob to write to the HyperDB. Internally, this means Alice +writes a special message to her feed saying that Bob's feed (identified by his +public key) should be read and replicated in by other participants. Her feed +becomes + +``` +// Feed + +0 (/foo/bar = 'baz') +1 (/foo/2 = '{ "some": "json" }') +2 ('' = '') + + +// Graph + +Alice: 0 <--- 1 <--- 2 +``` + +Authorization is formatted internally in a special way so that it isn't +interpreted as a key/value pair. + +Now Bob writes a value to his feed, and then Alice and Bob sync. The result is: + +``` +// Feed + +//// Alice +0 (/foo/bar = 'baz') +1 (/foo/2 = '{ "some": "json" }') +2 ('' = '') + +//// Bob +0 (/a/b = '12') + + +// Graph + +Alice: 0 <--- 1 <--- 2 +Bob : 0 +``` + +Notice that none of Alice's entries refer to Bob's, and vice versa. This is +because neither has written any entries to their feeds since the two became +aware of each other (authorized & replicated each other's feeds). + +Right now there are two "heads" of the graph: Alice's feed at seq 2, and Bob's +feed at seq 0. + +Next, Alice writes a new value, and her latest entry will refer to Bob's: + +``` +// Feed + +//// Alice +0 (/foo/bar = 'baz') +1 (/foo/2 = '{ "some": "json" }') +2 ('' = '') +3 (/foo/hup = 'beep') + +//// Bob +0 (/a/b = '12') + + +// Graph + +Alice: 0 <--- 1 <--- 2 <--/ 3 +Bob : 0 <-------------------/ +``` + +Because Alice's latest feed entry refers to Bob's latest feed entry, there is +now only one "head" in the database. That means there is enough information in +Alice's seq=3 entry to find any other key in the database. In the last example, +there were two heads (Alice's seq=2 and Bob's seq=0); both of which would need +to be read internally in order to locate any key in the database. + +Now there is only one "head": Alice's feed at seq 3. + + +## Authorization + +The set of hypercores are *authorized* in that the original author of the first +hypercore in a hyperdb must explicitly denote in their append-only log that the +public key of a new hypercore is permitted to edit the database. Any authorized +member may authorize more members. There is no revocation or other author +management elements currently. + + +## Vector clock + +Each node stores a [vector clock](https://en.wikipedia.org/wiki/Vector_clock) of +the last known sequence number from each feed it knows about. This is what forms +the DAG structure. + +A vector clock on a node of, say, `[0, 2, 5]` means: + +- when this node was written, the largest seq # in my local fed is 0 +- when this node was written, the largest seq # in the second feed I have is 2 +- when this node was written, the largest seq # in the third feed I have is 5 + +For example, Bob's vector clock for Alice's seq=3 entry above would be `[0, 3]` +since he knows of her latest entry (seq=3) and his own (seq=0). + +The vector clock is used for correctly traversing history. This is necessary for +the `db#heads` API as well as `db#createHistoryStream`. + + +# Examples +[examples]: #examples + +TODO: + + +# Security and Privacy Concerns +[privacy]: #privacy + +TODO: + + +# Drawbacks +[drawbacks]: #drawbacks + +TODO: Why should we *not* do this? + + +# Rationale and alternatives +[alternatives]: #alternatives + +TODO: + +- Why is this design the best in the space of possible designs? +- What other designs have been considered and what is the rationale for not choosing them? +- What is the impact of not doing this? + + +# Unresolved questions +[unresolved]: #unresolved-questions + +TODO: + +- What parts of the design do you expect to resolve through the DEP consensus process before this gets merged? +- What parts of the design do you expect to resolve through implementation and code review, or are left to independent library or application developers? +- What related issues do you consider out of scope for this DEP that could be addressed in the future independently of the solution that comes out of this DEP? + + +# Changelog +[changelog]: #changelog + +As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb +nodejs module on [github](https://github.com/mafintosh/hyperdb), which includes +multi-writer features and is the basis for this DEP. + +- 2017-12-06: @noffle publishes `ARCHITECTURE.md` overview in the + [hyperdb github repo][arch_md] +- 2018-03-XX: First partial draft submitted for review + +[arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md -- cgit v1.2.3 From 27d1cdc77676d5db996db6c3504f8ff3d6de159c Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 18 Mar 2018 23:57:56 -0700 Subject: work in progress on multi-writer DEP --- proposals/0000-multiwriter.md | 158 +++++++++++++++++++++++++++++++++++------- 1 file changed, 133 insertions(+), 25 deletions(-) diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md index ec5d2d3..a9b8542 100644 --- a/proposals/0000-multiwriter.md +++ b/proposals/0000-multiwriter.md @@ -7,7 +7,7 @@ Type: Standard Status: Undefined (as of 2018-03-XX) -Github PR: (add HTTPS link here after PR is opened) +Github PR: [Draft](https://github.com/datprotocol/DEPs/pull/10) Authors: [Bryan Newbold](https://github.com/bnewbold), @@ -18,21 +18,24 @@ Authors: # Summary [summary]: #summary -Multi-Writer is a set of schema, API, and feature extentions to multiple agents -(users, devices, or software) to write to the same HyperDB feed. By building on -top of this abstraction layer, future versions of hyperdrive and Dat will gain -these features. +Multi-Writer is a set of schema, API, and feature extentions to allow multiple +agents (users, devices, or software) to write to the same hyperdb database. By +building on top of this abstraction layer, future versions of hyperdrive and +Dat will gain these features. Mechanisms for distributed consistency and granting trust are specified here; the need for merge conflict algorithms and secure key distribution are mentioned but specific solutions are not specified. +This DEP forms the second half of the hyperdb specification; the first half +covered only the key/value database aspects of hyperdb. + # Motivation [motivation]: #motivation -The current hypercore/Dat ecosystem currently lacks solutions two fundamental -use cases: +The current hypercore/Dat ecosystem currently lacks solutions for two +fundamental use cases: - individual users should be able to modify distributed archives under their control from multiple devices, at a minimum to prevent loss of control of @@ -51,35 +54,133 @@ Solutions to these two use cases are seen as essential for many current and future Dat ecosystem applications. -# Semantics and Usage +# Concepts, Behavior, and Usage [usage-documentation]: #usage-documentation -TODO: semantics, terminology +The multi-writer features of hyperdb are implemented by storing and replicating +the contributions of each writer in a separate hypercore feed. This +specification concerns itself with the details of how changes from multiple +feeds (which may be written and replicated concurrently or asynchronously) are +securely combined to present a unified key/value interface. + +The following related concerns are explicitly left to application developers to +design and implement: + +- secure key distribution and authentication (eg, if a friend should be given + write access to a hyperdb database, how is that friend's feed key found and + verified?) +- merge conflict resolution, potentially using application-layer semantics + +Before we go any further, a few definitions: + +*Feed*: A hypercore feed: an append-only log of *Entries*, which can be +arbitrary data blobs. Hyperdb is built on top of several Feeds. + +*Writer*: a user (or user controlled device or software agent) that has a +distinct feed with a public/private key pair, and thus the ability to append +hypercore entries and "write" changes to their version of the database. + +*Original Writer*: the writer who created a given hyperdb database in the form +of the *Original Feed*. The public key of the original feed is the one used to +reference the database as a collection of feeds (eg, for the purpose of +discovery). + +At a high level, multi-writer hyperdb works by having existing authorized +writers (starting with the original writer) include authorization of new +writers by appending metadata to their own feed which points to the new feeds +(by public key). Each entry in each writer's feed contains "clock" metadata +that records the known state of the entire database (all writers) seen from the +perspective of that writer at the time they created the entry, in the form of +"clock" version pointers. This metadata (a "[vector clock][vc]") can be used by +other writers to resolve (or at least identify) conflicting content in the +database. The technical term for this type of system is a "Conflict-free +replicated data type" ([CRDT][crdt]). + +[vc]: https://en.wikipedia.org/wiki/Vector_clock +[crdt]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type + + +## API +[api]: #api + +The `db.get(key)` method described in the [hyperdb DEP][dep-hyperdb] actually +returns an array of values. If it is unambiguous what the consistent value is, +the array will have only that one value. If there is a conflict, multiple +values will be returned. If multiple values are returned, the writer should +attempt to merge the values (or chose one over the other) and write the result +to the database with `db.put(key, value)`. + +`db.authorize(key)` will write metadata to the local feed authorizing the new +feed (corresponding to `key`) to be included in the database. -TODO: things left to application developers (secure key distribution, non-trivial merge resolution) + +[dep-hyperdb]: https://github.com/bnewbold/dat-deps/blob/dep-hyperdb/proposals/0000-hyperdb.md -TODO: brief note on scaling properties (eg, reasonable numbers of feeds per database) + +## Scaling +[scaling]: #scaling + +TODO: brief note on scaling properties (eg, reasonable numbers of writers per +database) # Implementation [reference-documentation]: #reference-documentation -The protobuf schema fields of interest for multi-writer (from the "Node" -message type specified in the HyperDB DEP) are: +The complete protobuf schemas for the hyperdb "Entry" and "InflatedEntry" +message types (as specified in the hyperdb DEP) are: + + +``` +message Entry { + required string key = 1; + optional bytes value = 2; + required bytes trie = 3; + repeated uint64 clock = 4; + optional uint64 inflate = 5; +} + +message InflatedEntry { + message Feed { + required bytes key = 1; + } + + required string key = 1; + optional bytes value = 2; + required bytes trie = 3; + repeated uint64 clock = 4; + optional uint64 inflate = 5; + repeated Feed feeds = 6; + optional bytes contentFeed = 7; +} +``` + +The fields of interest for multi-writer are: + +- `clock`: a "vector clock" to record observed state at the time of writing the + entry. Included in every Entry and InflatedEntry. +- `inflate`: a back-pointer to the entry index of the most recent InflatedEntry + (containing a feed metadata change). Included in every Entry and + InflatedEntry. +- `feeds`: list of public keys for each writer's feed. Only included in + InflatedEntry, and only when feeds have changed. + +When serialized on disk in a SLEEP directory, the original feed is written +under `./source/`. If the "local" feed is different from the original feed, it +is written to `./local/`. All other feeds are written under directories +prefixed `./peers//`. -- `seq`: the sequence number of this entry in the owner's hypercore. 0 is the - first, 1 the second, and so forth. -- `feed`: the ID of the hypercore writer that wrote this -- `clock`: vector clock to determine node insertion causality +TODO: the above disk format is incorrect? ## Directed acyclic graph +[dag]: #dag -The combination of all operations performed on a HyperDB by all of its members +The combination of all operations performed on a hyperdb by all of its members forms a DAG (*directed acyclic graph*). Each write to the database (setting a key to a value) includes information to point backward at all of the known "heads" in the graph. -To illustrate what this means, let's say Alice starts a new HyperDB and writes 2 +To illustrate what this means, let's say Alice starts a new hyperdb and writes 2 values to it: ``` @@ -97,7 +198,7 @@ Alice: 0 <--- 1 Where sequence number 1 (the second entry) refers to sequence number 0 on the same feed (Alice's). -Now Alice *authorizes* Bob to write to the HyperDB. Internally, this means Alice +Now Alice *authorizes* Bob to write to the hyperdb. Internally, this means Alice writes a special message to her feed saying that Bob's feed (identified by his public key) should be read and replicated in by other participants. Her feed becomes @@ -218,7 +319,11 @@ TODO: # Drawbacks [drawbacks]: #drawbacks -TODO: Why should we *not* do this? +Significant increase in implementation, application, and user experience +complexity. + +Two hard problems (merges and secure key distribution) are left to application +developers. # Rationale and alternatives @@ -234,11 +339,14 @@ TODO: # Unresolved questions [unresolved]: #unresolved-questions -TODO: +What is the technical term for the specific CRDT we are using? +"Operation-based"? + +If there are conflicts resulting in ambiguity of whether a key has been deleted +or has a new value, does `db.get(key)` return an array of `[new_value, None]`? -- What parts of the design do you expect to resolve through the DEP consensus process before this gets merged? -- What parts of the design do you expect to resolve through implementation and code review, or are left to independent library or application developers? -- What related issues do you consider out of scope for this DEP that could be addressed in the future independently of the solution that comes out of this DEP? +What is a reasonable large number of writers to have in a single database? +Write "Scaling" section. # Changelog -- cgit v1.2.3 From 33541cbfbcfc2d45fa5b4336b969fbd9181fb8ef Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sat, 9 Jun 2018 18:12:21 -0700 Subject: address early maf comments on PR --- proposals/0000-multiwriter.md | 28 +++++++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md index a9b8542..fb60efc 100644 --- a/proposals/0000-multiwriter.md +++ b/proposals/0000-multiwriter.md @@ -69,7 +69,8 @@ design and implement: - secure key distribution and authentication (eg, if a friend should be given write access to a hyperdb database, how is that friend's feed key found and verified?) -- merge conflict resolution, potentially using application-layer semantics +- merge conflict resolution (using the provided API), potentially using + application-layer semantics Before we go any further, a few definitions: @@ -120,8 +121,14 @@ feed (corresponding to `key`) to be included in the database. ## Scaling [scaling]: #scaling -TODO: brief note on scaling properties (eg, reasonable numbers of writers per -database) +There is some overhead associated with each "writer" added to the feed, +impacting the number of files on disk, memory use, and the computational cost +of some lookup oprations. The design should easily accomodate dozens of +writers, and should scale to 1,000 writers without too much additional +overhead. Note that a large number of writers also implies a larger number and +rate of append operations, and additional network connections, which may cause +scaling issues on their own. More real-world experience and benchmarking is +needed in this area. # Implementation @@ -298,7 +305,8 @@ A vector clock on a node of, say, `[0, 2, 5]` means: - when this node was written, the largest seq # in the third feed I have is 5 For example, Bob's vector clock for Alice's seq=3 entry above would be `[0, 3]` -since he knows of her latest entry (seq=3) and his own (seq=0). +since he knows of her latest entry (seq=3) and his own (seq=0). Note that the +order of clocks is not consistent across writers, only within the same feed. The vector clock is used for correctly traversing history. This is necessary for the `db#heads` API as well as `db#createHistoryStream`. @@ -329,6 +337,13 @@ developers. # Rationale and alternatives [alternatives]: #alternatives +Design goals for hyperdb (including the multi-writer feature) included: + +- ability to execute operations (get, put) with a sparse (partial) replication + of the database, using as few additional network requests as possible +- minimal on-disk and on-wire overhead +- implemented on top of an append-only log (to build on top of hypercore) + TODO: - Why is this design the best in the space of possible designs? @@ -344,6 +359,7 @@ What is the technical term for the specific CRDT we are using? If there are conflicts resulting in ambiguity of whether a key has been deleted or has a new value, does `db.get(key)` return an array of `[new_value, None]`? +Answer: `get` always returns nodes (not just values), so context is included. In the case of a deletion, a the value within the node will be `null`. What is a reasonable large number of writers to have in a single database? Write "Scaling" section. @@ -356,8 +372,10 @@ As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb nodejs module on [github](https://github.com/mafintosh/hyperdb), which includes multi-writer features and is the basis for this DEP. +Jim Pick (@jimpick) has been an active contributor working out multi-writer details. + - 2017-12-06: @noffle publishes `ARCHITECTURE.md` overview in the [hyperdb github repo][arch_md] -- 2018-03-XX: First partial draft submitted for review +- 2018-06-XX: First partial draft submitted for review [arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md -- cgit v1.2.3 From 89c41746bc8c3a43651ede387b9ca39fc2ba2a2d Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sat, 9 Jun 2018 22:10:14 -0700 Subject: tweaks, updates, some progress --- proposals/0000-multiwriter.md | 65 ++++++++++++++++++++++++++++--------------- 1 file changed, 43 insertions(+), 22 deletions(-) diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md index fb60efc..b83606a 100644 --- a/proposals/0000-multiwriter.md +++ b/proposals/0000-multiwriter.md @@ -75,7 +75,10 @@ design and implement: Before we go any further, a few definitions: *Feed*: A hypercore feed: an append-only log of *Entries*, which can be -arbitrary data blobs. Hyperdb is built on top of several Feeds. +arbitrary data blobs. + +*Database*: in this context, a Hyperdb key/value database. Built from several +Feeds (two Feeds per Writer). *Writer*: a user (or user controlled device or software agent) that has a distinct feed with a public/private key pair, and thus the ability to append @@ -101,21 +104,33 @@ replicated data type" ([CRDT][crdt]). [crdt]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type -## API +## Core API [api]: #api -The `db.get(key)` method described in the [hyperdb DEP][dep-hyperdb] actually -returns an array of values. If it is unambiguous what the consistent value is, -the array will have only that one value. If there is a conflict, multiple -values will be returned. If multiple values are returned, the writer should -attempt to merge the values (or chose one over the other) and write the result -to the database with `db.put(key, value)`. +A "node" is a data structure representing a database entry, including the +`key`, `value`, and feed that the entry is commited to. + +`db.get(key)` (as described in the [hyperdb DEP][dep-hyperdb]) +returns an array of nodes. If it is unambiguous what the consistent state of a key is, +the array will have only that one value. If there is a conflict (ambiguity), +multiple nodes will be returned. + +If multiple nodes are returned from a `get`, the writer should attempt to merge +the values (or chose one over the other) and write the result to the database +with `db.put(key, value)`. Client libraries can make this process more +ergonomic by accepting a helper function to select between multiple nodes. +Libraries can also offer an option to either directly return the value of a +single node (instead of the node itself), or raise an error; this is likely to +be more ergonomic for applications which do not intend to support multiple +writers per database. `db.authorize(key)` will write metadata to the local feed authorizing the new feed (corresponding to `key`) to be included in the database. - -[dep-hyperdb]: https://github.com/bnewbold/dat-deps/blob/dep-hyperdb/proposals/0000-hyperdb.md +`db.authorized(key)` (returning a boolean) indicates whether the given `key` is +an authorized writer to the hyperdb database. + +[dep-hyperdb]: https://github.com/datprotocol/DEPs/blob/master/proposals/0004-hyperdb.md ## Scaling @@ -131,7 +146,7 @@ scaling issues on their own. More real-world experience and benchmarking is needed in this area. -# Implementation +# Implementation Details [reference-documentation]: #reference-documentation The complete protobuf schemas for the hyperdb "Entry" and "InflatedEntry" @@ -142,9 +157,10 @@ message types (as specified in the hyperdb DEP) are: message Entry { required string key = 1; optional bytes value = 2; - required bytes trie = 3; - repeated uint64 clock = 4; - optional uint64 inflate = 5; + optional bool deleted = 3; + required bytes trie = 4; + repeated uint64 clock = 5; + optional uint64 inflate = 6; } message InflatedEntry { @@ -154,11 +170,12 @@ message InflatedEntry { required string key = 1; optional bytes value = 2; - required bytes trie = 3; - repeated uint64 clock = 4; - optional uint64 inflate = 5; - repeated Feed feeds = 6; - optional bytes contentFeed = 7; + optional bool deleted = 3; + required bytes trie = 4; + repeated uint64 clock = 5; + optional uint64 inflate = 6; + repeated Feed feeds = 7; + optional bytes contentFeed = 8; } ``` @@ -173,9 +190,9 @@ The fields of interest for multi-writer are: InflatedEntry, and only when feeds have changed. When serialized on disk in a SLEEP directory, the original feed is written -under `./source/`. If the "local" feed is different from the original feed, it -is written to `./local/`. All other feeds are written under directories -prefixed `./peers//`. +under `source/`. If the "local" feed is different from the original feed (aka, +local is not the "Original Writer"), it is written to `local/`. All other feeds +are written under directories prefixed `./peers//`. TODO: the above disk format is incorrect? @@ -364,6 +381,10 @@ Answer: `get` always returns nodes (not just values), so context is included. In What is a reasonable large number of writers to have in a single database? Write "Scaling" section. +The javascript hyperdb implementation has a new `deleted` flag and `Header` +prefix message. These fall under the scope of the `hyperdb` DEP; should they be +mentioned here? + # Changelog [changelog]: #changelog -- cgit v1.2.3 From 714563f4e49add95f13f42e8dfb4e34c9b964fd0 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sat, 9 Jun 2018 22:56:32 -0700 Subject: update unresolved, security, usage --- proposals/0000-multiwriter.md | 90 +++++++++++++++++++++++++++---------------- 1 file changed, 57 insertions(+), 33 deletions(-) diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md index b83606a..07116a2 100644 --- a/proposals/0000-multiwriter.md +++ b/proposals/0000-multiwriter.md @@ -113,7 +113,10 @@ A "node" is a data structure representing a database entry, including the `db.get(key)` (as described in the [hyperdb DEP][dep-hyperdb]) returns an array of nodes. If it is unambiguous what the consistent state of a key is, the array will have only that one value. If there is a conflict (ambiguity), -multiple nodes will be returned. +multiple nodes will be returned. If a key has been deleted from the database, a +node with the `deleted` flag will be returned; note that such "tombstone" nodes +can still have a `value` field, which may contain application-specific metadata +(such as a timestamp). If multiple nodes are returned from a `get`, the writer should attempt to merge the values (or chose one over the other) and write the result to the database @@ -125,11 +128,14 @@ be more ergonomic for applications which do not intend to support multiple writers per database. `db.authorize(key)` will write metadata to the local feed authorizing the new -feed (corresponding to `key`) to be included in the database. +feed (corresponding to `key`) to be included in the database. Once authorized, +a feed may further authorize additional feeds (recursively). `db.authorized(key)` (returning a boolean) indicates whether the given `key` is an authorized writer to the hyperdb database. +At the time of this DEP there is no mechanism for revoking authorization. + [dep-hyperdb]: https://github.com/datprotocol/DEPs/blob/master/proposals/0004-hyperdb.md @@ -192,9 +198,8 @@ The fields of interest for multi-writer are: When serialized on disk in a SLEEP directory, the original feed is written under `source/`. If the "local" feed is different from the original feed (aka, local is not the "Original Writer"), it is written to `local/`. All other feeds -are written under directories prefixed `./peers//`. +are written under directories prefixed `peers//`. -TODO: the above disk format is incorrect? ## Directed acyclic graph [dag]: #dag @@ -300,14 +305,6 @@ to be read internally in order to locate any key in the database. Now there is only one "head": Alice's feed at seq 3. -## Authorization - -The set of hypercores are *authorized* in that the original author of the first -hypercore in a hyperdb must explicitly denote in their append-only log that the -public key of a new hypercore is permitted to edit the database. Any authorized -member may authorize more members. There is no revocation or other author -management elements currently. - ## Vector clock @@ -338,17 +335,50 @@ TODO: # Security and Privacy Concerns [privacy]: #privacy -TODO: +As noted above, there is no existing mechanism for removing authorization for a +feed once added, and an authorized feed may recursively authorize additional +feeds. There is also no mechanism to restrict the scope of an authorized feed's +actions (eg, limit to only a specific path prefix). This leaves application +designers and users with few tools to control trust or access ("all or +nothing"). Care must be taken in particular if self-mutating software is being +distributed via hyperdb, or when action may be taken automatically based on the +most recent content of a database (eg, bots or even third-party tools may +publish publicly, or even take real-world action like controlling an electrical +relay). + +There is no mechanism to remove malicious history (or any history for that +matter); if an authorized (but hostile) writer appends a huge number of key +operations (bloating hyperdb metadata size), or posts offensive or illegal +content to a database, there is no way to permanently remove the data without +creating an new database. + +The read semantics of hyperdb are unchanged from hypercore: an actor does not +need to be "authorized" (for writing) to read the full history of a database, +they only need the public key. # Drawbacks [drawbacks]: #drawbacks -Significant increase in implementation, application, and user experience -complexity. - -Two hard problems (merges and secure key distribution) are left to application -developers. +Mutli-writer capability incurs a non-trivial increase in library, application, +and user experience complexity. For many applications, collaboration is an +essential feature, and the complexity is easily justified. To minimize +complexity for applications which do not need multi-writer features, +implementation authors should consider configuration modes which hide the +complexity of unused features. For example, by having an option to returning a +single node for a `get()` (and throw an error if there is a conflict), or a +flag to throw an error if a database unexpectedly contains more than a single +feed. + +Two tasks (conflict merges and secure key distribution) are left to application +developers. Both of these are Hard Problems. The current design mitigates the +former by reducing the number of merge conflicts that need to be handled by an +application (aka, only the non-trivial ones need to be handled), and +implementation authors are encouraged to provide an ergonomic API for writing +conflict resolvers. The second problem (secure key distribution) is out of +scope for this DEP. It is hoped that at least one pattern or library will +emerge from the Dat ecosystem such that each application author doesn't need to +invent a solution from scratch. # Rationale and alternatives @@ -361,29 +391,23 @@ Design goals for hyperdb (including the multi-writer feature) included: - minimal on-disk and on-wire overhead - implemented on top of an append-only log (to build on top of hypercore) -TODO: +If a solution for core use cases like collaboration and multi-device +synchronization is not provided at a low level (as this DEP provides), each +application will need to invent a solution at a higher level, incuring +duplicated effort and a higher risk of bugs. -- Why is this design the best in the space of possible designs? -- What other designs have been considered and what is the rationale for not choosing them? -- What is the impact of not doing this? +As an alternative to CRDTs, Operational Transformation (OT) has a reputation +for being more difficult to understand and implement. # Unresolved questions [unresolved]: #unresolved-questions What is the technical term for the specific CRDT we are using? -"Operation-based"? - -If there are conflicts resulting in ambiguity of whether a key has been deleted -or has a new value, does `db.get(key)` return an array of `[new_value, None]`? -Answer: `get` always returns nodes (not just values), so context is included. In the case of a deletion, a the value within the node will be `null`. - -What is a reasonable large number of writers to have in a single database? -Write "Scaling" section. +"Operation-based" or "State-based"? -The javascript hyperdb implementation has a new `deleted` flag and `Header` -prefix message. These fall under the scope of the `hyperdb` DEP; should they be -mentioned here? +What is the actual on-disk layout (folder structure), if not what is documented +here? # Changelog -- cgit v1.2.3 From 3b8215a805048819813f39f9181a4553442f37b4 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 10 Jun 2018 01:18:51 -0700 Subject: update impl reference and example --- proposals/0000-multiwriter.md | 244 +++++++++++++++++++++++++++--------------- 1 file changed, 158 insertions(+), 86 deletions(-) diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md index 07116a2..ed35df6 100644 --- a/proposals/0000-multiwriter.md +++ b/proposals/0000-multiwriter.md @@ -200,136 +200,192 @@ under `source/`. If the "local" feed is different from the original feed (aka, local is not the "Original Writer"), it is written to `local/`. All other feeds are written under directories prefixed `peers//`. +## Feeds and Vector Clocks + +At any point in time, each writer has a potentially unique view of the +"current" state of the database as a whole; this is the nature of real-world +distributed systems. For example, a given write might have the most recent +appends from one peer (eg, only seconds old), but be missing much older appends +from another (eg, days or weeks out of date). By having each writer include +metadata about their percieved state of the system as a whole in operations to +their Feed, all writers are able to collectively converge on an "eventually +consistent" picture of the database as whole (this process will be described in +the next section). + +A writer's "current known state" of the database consists of the set of active +Feeds, and for each the most recent entry sequence number ("version"). This +state can be serialized as an array of integers, refered to as a [vector +clock](https://en.wikipedia.org/wiki/Vector_clock). + +Each `put()` operation on the database appends a node to the writer's `local` +feed, and contains the writer's vector clock as observed at that time. +`InflatedEntry` nodes also contain a list of all known authorized Feeds; +inflated nodes only need to be written when the Feed list changes. Every +non-inflated entry contains a pointer back to the most recent inflated entry; +inflated entries themselves contain a pointer back to the previous inflated +entry (the first inflated entry has a null pointer). Elements of a vector clock +are ordered by the Feed list from the corresponding Inflated entry. + +By convention, the order of Feed lists is to start with the writer's local +feed first, then proceed by the order in which Feeds were discovered. Note that +this ordering is not consistent across writers, only within the same feed. + +As an example, if a node (non-inflated entry) had a vector clock of `[0, 2, +5]`, that would mean: + +- when this node was written, the largest seq # in the writer's local fed was 0 +- when this node was written, the largest seq # in the second known feed was 2 +- when this node was written, the largest seq # in the third known feed was 5 + + +## Multi-Feed Aware hyperdb +[multi-aware]: #multi-aware + +The [hyperdb DEP](hyperdb-dep) specifies the procedures for lookup (`get()`) +and append (`put()`) operations to the database, as well as binary encoding +schemes for entry messages. + +Note that the trie encoding specifies pointers in a `(feed, entry)` pair +format. The `feed` integer is an index into the most recent Feed list (found in +the most recent inflated entry; see the last section). When working with a +multi-writer hyperdb database, simply look up entries in the appropriate feed, +instead of only looking in the current feed. The next section ("Consistent +History") describes which entry (or entries) to start with instead of simply +assuming the most recent entry from the local feed. + + +## Consistent History +[consist-history]: #consist-history + +The set of all appended nodes in all feeds of a hyperdb, and all the vector +clock pointers between them, forms a "directed acyclic graph" (DAG). Any node +which does not have another node pointing to it is called a "head" (this +terminology is similar to that used in git). At any point in time, an observed +copy of a database has one or more heads, each representing the top of a +tree-like graph. In the trivial case of a non-multi-writer hyperdb, there is +always only a single head: the most recent entry in the local feed. Just after +appending to the local feed, there is also always a single head, because that +node's vector clock will reference all know most recent entries from other +feeds. It is only when nodes are appended by separate writers who did not know +of the others' concurrent action (and then these changes are replicated) that +there are multiple heads. + +When operating on the database (eg, executing a `get()` operation), all heads +must be considered. The lookup proceedure documented in the [hyperdb +DEP](hyperdb-dep) must be repeated for each head, and nodes returned +representing the set of all unique values. + +The situation where a `get()` operation multiple heads returns different values +for the same key is called a "conflict" and requires a "merge" to resolve. Some +writer (either a human being or application-layer code) must decide on the +correct value for the key and write that value as a new entry (with a vector +clock that includes the previous heads). The procedure for chosing the best +value to use in a given conflict is sometimes easy to determine, but is +impossible to determine algorithmically in the general case. See the "Usage" +section for more details. -## Directed acyclic graph -[dag]: #dag -The combination of all operations performed on a hyperdb by all of its members -forms a DAG (*directed acyclic graph*). Each write to the database (setting a -key to a value) includes information to point backward at all of the known -"heads" in the graph. +# Examples +[examples]: #examples -To illustrate what this means, let's say Alice starts a new hyperdb and writes 2 -values to it: +Let's say Alice starts a new hyperdb and writes two key/value entries to it: ``` -// Feed - -0 (/foo/bar = 'baz') -1 (/foo/2 = '{ "some": "json" }') +// Operations +Alice: db.put('/foo/bar', 'baz') +Alice: db.put('/foo/2', '{"json":3}') +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) // Graph - Alice: 0 <--- 1 ``` -Where sequence number 1 (the second entry) refers to sequence number 0 on the -same feed (Alice's). +The vector clock at `seq=1` points back to `seq=0`. -Now Alice *authorizes* Bob to write to the hyperdb. Internally, this means Alice -writes a special message to her feed saying that Bob's feed (identified by his -public key) should be read and replicated in by other participants. Her feed -becomes +Next Alice *authorizes* Bob to write to the database. Internally, this means Alice +writes an Inflated entry to her feed that contains Bob's Feed (identified by his +public key) in her feed list. ``` -// Feed - -0 (/foo/bar = 'baz') -1 (/foo/2 = '{ "some": "json" }') -2 ('' = '') +// Operations +Alice: db.authorize('b0b123...') +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) +2 (key=null, value=null, + vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) // Graph - Alice: 0 <--- 1 <--- 2 ``` -Authorization is formatted internally in a special way so that it isn't -interpreted as a key/value pair. - -Now Bob writes a value to his feed, and then Alice and Bob sync. The result is: +Bob writes a value to his feed, and then Alice and Bob sync. The result is: ``` -// Feed +// Operations +Bob: db.put('/a/b', '12) -//// Alice -0 (/foo/bar = 'baz') -1 (/foo/2 = '{ "some": "json" }') -2 ('' = '') - -//// Bob -0 (/a/b = '12') +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) +2 (key=null, value=null, + vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) +// Bob's Feed +0 (key='/a/b', value='12', + vector_clock=[], inflated=null, feeds=['b0b123...']) (InflatedEntry)) // Graph - Alice: 0 <--- 1 <--- 2 Bob : 0 ``` -Notice that none of Alice's entries refer to Bob's, and vice versa. This is -because neither has written any entries to their feeds since the two became -aware of each other (authorized & replicated each other's feeds). - +Notice that none of Alice's entries refer to Bob's, and vice versa. Neither has +written any entries to their feeds since the two became aware of each other. Right now there are two "heads" of the graph: Alice's feed at seq 2, and Bob's -feed at seq 0. +feed at seq 0. Any `get()` operations would need to descend from both heads, +though in this situation there would be no conflicts as the keys in the two +feeds are disjoint. Next, Alice writes a new value, and her latest entry will refer to Bob's: ``` -// Feed +// Operations +Alice: db.put('/foo/hup', 'beep') -//// Alice -0 (/foo/bar = 'baz') -1 (/foo/2 = '{ "some": "json" }') -2 ('' = '') -3 (/foo/hup = 'beep') +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) +2 (key=null, value=null, + vector_clock=[1, null], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) +3 (key='/foo/hup', value='beep', + vector_clock=[2,0], inflated=2) -//// Bob -0 (/a/b = '12') +// Bob's Feed +0 (key='/a/b', value='12', + vector_clock=[], inflated=null, feeds=['b0b123...']) (InflatedEntry)) // Graph - Alice: 0 <--- 1 <--- 2 <--/ 3 Bob : 0 <-------------------/ ``` -Because Alice's latest feed entry refers to Bob's latest feed entry, there is -now only one "head" in the database. That means there is enough information in -Alice's seq=3 entry to find any other key in the database. In the last example, -there were two heads (Alice's seq=2 and Bob's seq=0); both of which would need -to be read internally in order to locate any key in the database. - -Now there is only one "head": Alice's feed at seq 3. - - - -## Vector clock - -Each node stores a [vector clock](https://en.wikipedia.org/wiki/Vector_clock) of -the last known sequence number from each feed it knows about. This is what forms -the DAG structure. - -A vector clock on a node of, say, `[0, 2, 5]` means: - -- when this node was written, the largest seq # in my local fed is 0 -- when this node was written, the largest seq # in the second feed I have is 2 -- when this node was written, the largest seq # in the third feed I have is 5 - -For example, Bob's vector clock for Alice's seq=3 entry above would be `[0, 3]` -since he knows of her latest entry (seq=3) and his own (seq=0). Note that the -order of clocks is not consistent across writers, only within the same feed. - -The vector clock is used for correctly traversing history. This is necessary for -the `db#heads` API as well as `db#createHistoryStream`. - - -# Examples -[examples]: #examples - -TODO: +Alice's latest feed entry now points to Bob's latest feed entry, and there is +only one "head" in the database. This means that any `get()` operations only +need to run once, starting at `seq=3` in Alice's feed. # Security and Privacy Concerns @@ -356,6 +412,10 @@ The read semantics of hyperdb are unchanged from hypercore: an actor does not need to be "authorized" (for writing) to read the full history of a database, they only need the public key. +As noted in other DEPs, a malicious writer can potentially execute a denial of +service (DoS) attack by appending hyperdb entries that for a cyclic loop of +references. + # Drawbacks [drawbacks]: #drawbacks @@ -409,6 +469,18 @@ What is the technical term for the specific CRDT we are using? What is the actual on-disk layout (folder structure), if not what is documented here? +Couldn't the local feed's sequence number be skipped from vector clocks, +because it's implied by the sequence number of the hypercore entry itself? Same +with the key in the feed list (for inflated entries). + +If there are multiple heads, but they all return the same `value` for a `get()` +operation, how is it decided which `node` will be returned? AKA, values are the +same, but node metadata might not be (order of vector clock, etc). + +Suspect that some details are off in the example: shouldn't the InflatedEntry +authorizing a new feed include a vector clock reference to a specific seq in +that feed? Should new local (not yet authorized) feeds reference reference +their source in an initial InflatedEntry (at seq=0)? # Changelog [changelog]: #changelog -- cgit v1.2.3 From 0e9bf58695348e6a0409d5f6a08e0e19a9b318df Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 10 Jun 2018 01:26:36 -0700 Subject: tweak vector_clock and inflated refs in example --- proposals/0000-multiwriter.md | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md index ed35df6..d2e6318 100644 --- a/proposals/0000-multiwriter.md +++ b/proposals/0000-multiwriter.md @@ -191,9 +191,12 @@ The fields of interest for multi-writer are: entry. Included in every Entry and InflatedEntry. - `inflate`: a back-pointer to the entry index of the most recent InflatedEntry (containing a feed metadata change). Included in every Entry and - InflatedEntry. + InflatedEntry. Should not be included for the very first entry in a feed + (which is an InflatedEntry). - `feeds`: list of public keys for each writer's feed. Only included in - InflatedEntry, and only when feeds have changed. + InflatedEntry, and only when feeds have changed. Does include a + self-reference to the current (local) Feed's key, always as the first + element. When serialized on disk in a SLEEP directory, the original feed is written under `source/`. If the "local" feed is different from the original feed (aka, @@ -297,7 +300,7 @@ Alice: db.put('/foo/2', '{"json":3}') // Alice's Feed 0 (key='/foo/bar', value='baz', - vector_clock=[], inflated=null, feeds=['a11ce...']) (InflatedEntry) + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) 1 (key='/foo/2', value='{"json":3}', vector_clock=[0], inflated=0) @@ -317,7 +320,7 @@ Alice: db.authorize('b0b123...') // Alice's Feed 0 (key='/foo/bar', value='baz', - vector_clock=[], inflated=null, feeds=['a11ce...']) (InflatedEntry) + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) 1 (key='/foo/2', value='{"json":3}', vector_clock=[0], inflated=0) 2 (key=null, value=null, @@ -335,7 +338,7 @@ Bob: db.put('/a/b', '12) // Alice's Feed 0 (key='/foo/bar', value='baz', - vector_clock=[], inflated=null, feeds=['a11ce...']) (InflatedEntry) + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) 1 (key='/foo/2', value='{"json":3}', vector_clock=[0], inflated=0) 2 (key=null, value=null, @@ -343,7 +346,7 @@ Bob: db.put('/a/b', '12) // Bob's Feed 0 (key='/a/b', value='12', - vector_clock=[], inflated=null, feeds=['b0b123...']) (InflatedEntry)) + vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry)) // Graph Alice: 0 <--- 1 <--- 2 @@ -365,7 +368,7 @@ Alice: db.put('/foo/hup', 'beep') // Alice's Feed 0 (key='/foo/bar', value='baz', - vector_clock=[], inflated=null, feeds=['a11ce...']) (InflatedEntry) + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) 1 (key='/foo/2', value='{"json":3}', vector_clock=[0], inflated=0) 2 (key=null, value=null, @@ -375,7 +378,7 @@ Alice: db.put('/foo/hup', 'beep') // Bob's Feed 0 (key='/a/b', value='12', - vector_clock=[], inflated=null, feeds=['b0b123...']) (InflatedEntry)) + vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry)) // Graph @@ -479,8 +482,9 @@ same, but node metadata might not be (order of vector clock, etc). Suspect that some details are off in the example: shouldn't the InflatedEntry authorizing a new feed include a vector clock reference to a specific seq in -that feed? Should new local (not yet authorized) feeds reference reference -their source in an initial InflatedEntry (at seq=0)? +that feed? Should new local (not yet authorized) feeds reference +their source in an initial InflatedEntry (eg, Bob at seq=0)? Should the first +InflatedEntry in a feed point to itself in it's vector clock? # Changelog [changelog]: #changelog -- cgit v1.2.3 From 89504750ce92362fa23e5573753eaa5329b490c3 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 6 Jul 2018 11:28:19 -0400 Subject: multiwriter: op-based CRDT --- proposals/0000-multiwriter.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md index d2e6318..e8f931a 100644 --- a/proposals/0000-multiwriter.md +++ b/proposals/0000-multiwriter.md @@ -98,7 +98,8 @@ perspective of that writer at the time they created the entry, in the form of "clock" version pointers. This metadata (a "[vector clock][vc]") can be used by other writers to resolve (or at least identify) conflicting content in the database. The technical term for this type of system is a "Conflict-free -replicated data type" ([CRDT][crdt]). +replicated data type" ([CRDT][crdt]), and specifically an "Operation-based" (as +opposed to "State-based") CRDT. [vc]: https://en.wikipedia.org/wiki/Vector_clock [crdt]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type @@ -466,9 +467,6 @@ for being more difficult to understand and implement. # Unresolved questions [unresolved]: #unresolved-questions -What is the technical term for the specific CRDT we are using? -"Operation-based" or "State-based"? - What is the actual on-disk layout (folder structure), if not what is documented here? -- cgit v1.2.3 From a81c069501ab3c99d8129fde9e4f97958f62fe5f Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 6 Jul 2018 11:30:04 -0400 Subject: multiwriter: review feedback Thanks @mafintosh, @pfrazee --- proposals/0000-multiwriter.md | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md index e8f931a..8264786 100644 --- a/proposals/0000-multiwriter.md +++ b/proposals/0000-multiwriter.md @@ -112,12 +112,15 @@ A "node" is a data structure representing a database entry, including the `key`, `value`, and feed that the entry is commited to. `db.get(key)` (as described in the [hyperdb DEP][dep-hyperdb]) -returns an array of nodes. If it is unambiguous what the consistent state of a key is, -the array will have only that one value. If there is a conflict (ambiguity), -multiple nodes will be returned. If a key has been deleted from the database, a -node with the `deleted` flag will be returned; note that such "tombstone" nodes -can still have a `value` field, which may contain application-specific metadata -(such as a timestamp). +returns an array of nodes. If it is unambiguous what the consistent state of a +key is, the array will have only that one value. If there is a conflict +(ambiguity), multiple nodes will be returned. If a key has unambiguously been +removed from the database, a "null" or empty datatype is returned. If one +branch of a conflict has a deletion (but at least one of the others does not), +a node with the `deleted` flag will be returned; note that such "tombstone" +nodes can still have a `value` field, which may contain application-specific +metadata (such as a self-reported timestamp), which may help resolve the +conflict. If multiple nodes are returned from a `get`, the writer should attempt to merge the values (or chose one over the other) and write the result to the database @@ -470,9 +473,10 @@ for being more difficult to understand and implement. What is the actual on-disk layout (folder structure), if not what is documented here? -Couldn't the local feed's sequence number be skipped from vector clocks, -because it's implied by the sequence number of the hypercore entry itself? Same -with the key in the feed list (for inflated entries). +The local feed's sequence number could skipped from vector clocks, because it's +implied by the sequence number of the hypercore entry itself. Same with the key +in the feed list (for inflated entries). In both cases, the redundant data is +retained for simplicity. If there are multiple heads, but they all return the same `value` for a `get()` operation, how is it decided which `node` will be returned? AKA, values are the -- cgit v1.2.3 From eb1334949faed0238d5251a0794067caa643b713 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 6 Jul 2018 11:35:38 -0400 Subject: multiwriter on-disk folder structure --- proposals/0000-multiwriter.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md index 8264786..eb25c3d 100644 --- a/proposals/0000-multiwriter.md +++ b/proposals/0000-multiwriter.md @@ -202,10 +202,15 @@ The fields of interest for multi-writer are: self-reference to the current (local) Feed's key, always as the first element. -When serialized on disk in a SLEEP directory, the original feed is written -under `source/`. If the "local" feed is different from the original feed (aka, -local is not the "Original Writer"), it is written to `local/`. All other feeds -are written under directories prefixed `peers//`. +When serialized on disk in a SLEEP directory: + +- `source/`: the original feed, as created or cloned by this writer +- `local/`: if "local" feed is different from "source", it goes here +- `peers//`: all other writers' feed go under this directory (the + discovery key is lower-case hex-encoded) +- `content//`: if a higher-level protocol is being used that + uses multiple linked hypercore feeds (eg, hyperdrive), the linked "content" + feeds all go under this directory ## Feeds and Vector Clocks -- cgit v1.2.3 From 96eb0d32363deaa085bc133db1b6fda25cff2b55 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 6 Jul 2018 13:39:56 -0400 Subject: multiwriter as DEP-0008 --- proposals/0000-multiwriter.md | 509 ----------------------------------------- proposals/0008-multiwriter.md | 512 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 512 insertions(+), 509 deletions(-) delete mode 100644 proposals/0000-multiwriter.md create mode 100644 proposals/0008-multiwriter.md diff --git a/proposals/0000-multiwriter.md b/proposals/0000-multiwriter.md deleted file mode 100644 index eb25c3d..0000000 --- a/proposals/0000-multiwriter.md +++ /dev/null @@ -1,509 +0,0 @@ - -Title: **DEP-0000: Multi-Writer** - -Short Name: `0000-multiwriter` - -Type: Standard - -Status: Undefined (as of 2018-03-XX) - -Github PR: [Draft](https://github.com/datprotocol/DEPs/pull/10) - -Authors: -[Bryan Newbold](https://github.com/bnewbold), -[Stephen Whitmore](https://github.com/noffle), -[Mathias Buus](https://github.com/mafintosh) - - -# Summary -[summary]: #summary - -Multi-Writer is a set of schema, API, and feature extentions to allow multiple -agents (users, devices, or software) to write to the same hyperdb database. By -building on top of this abstraction layer, future versions of hyperdrive and -Dat will gain these features. - -Mechanisms for distributed consistency and granting trust are specified here; -the need for merge conflict algorithms and secure key distribution are -mentioned but specific solutions are not specified. - -This DEP forms the second half of the hyperdb specification; the first half -covered only the key/value database aspects of hyperdb. - - -# Motivation -[motivation]: #motivation - -The current hypercore/Dat ecosystem currently lacks solutions for two -fundamental use cases: - -- individual users should be able to modify distributed archives under their - control from multiple devices, at a minimum to prevent loss of control of - content if a single device (containing secret keys) is lost -- contributions from and collaboration between multiple users on a single - archive or database should be possible, with appropriate trust and access - control semantics - -Access to a single secret key is currently required to make any change to a -hypercore feed, and it is broadly considered best practice not to distribute -secret keys between multiple users or multiple devices. In fact, the current -hypercore implementation has no mechanism to resolve disputes or recover if -multiple agents used the same secret key to append to the same feed. - -Solutions to these two use cases are seen as essential for many current and -future Dat ecosystem applications. - - -# Concepts, Behavior, and Usage -[usage-documentation]: #usage-documentation - -The multi-writer features of hyperdb are implemented by storing and replicating -the contributions of each writer in a separate hypercore feed. This -specification concerns itself with the details of how changes from multiple -feeds (which may be written and replicated concurrently or asynchronously) are -securely combined to present a unified key/value interface. - -The following related concerns are explicitly left to application developers to -design and implement: - -- secure key distribution and authentication (eg, if a friend should be given - write access to a hyperdb database, how is that friend's feed key found and - verified?) -- merge conflict resolution (using the provided API), potentially using - application-layer semantics - -Before we go any further, a few definitions: - -*Feed*: A hypercore feed: an append-only log of *Entries*, which can be -arbitrary data blobs. - -*Database*: in this context, a Hyperdb key/value database. Built from several -Feeds (two Feeds per Writer). - -*Writer*: a user (or user controlled device or software agent) that has a -distinct feed with a public/private key pair, and thus the ability to append -hypercore entries and "write" changes to their version of the database. - -*Original Writer*: the writer who created a given hyperdb database in the form -of the *Original Feed*. The public key of the original feed is the one used to -reference the database as a collection of feeds (eg, for the purpose of -discovery). - -At a high level, multi-writer hyperdb works by having existing authorized -writers (starting with the original writer) include authorization of new -writers by appending metadata to their own feed which points to the new feeds -(by public key). Each entry in each writer's feed contains "clock" metadata -that records the known state of the entire database (all writers) seen from the -perspective of that writer at the time they created the entry, in the form of -"clock" version pointers. This metadata (a "[vector clock][vc]") can be used by -other writers to resolve (or at least identify) conflicting content in the -database. The technical term for this type of system is a "Conflict-free -replicated data type" ([CRDT][crdt]), and specifically an "Operation-based" (as -opposed to "State-based") CRDT. - -[vc]: https://en.wikipedia.org/wiki/Vector_clock -[crdt]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type - - -## Core API -[api]: #api - -A "node" is a data structure representing a database entry, including the -`key`, `value`, and feed that the entry is commited to. - -`db.get(key)` (as described in the [hyperdb DEP][dep-hyperdb]) -returns an array of nodes. If it is unambiguous what the consistent state of a -key is, the array will have only that one value. If there is a conflict -(ambiguity), multiple nodes will be returned. If a key has unambiguously been -removed from the database, a "null" or empty datatype is returned. If one -branch of a conflict has a deletion (but at least one of the others does not), -a node with the `deleted` flag will be returned; note that such "tombstone" -nodes can still have a `value` field, which may contain application-specific -metadata (such as a self-reported timestamp), which may help resolve the -conflict. - -If multiple nodes are returned from a `get`, the writer should attempt to merge -the values (or chose one over the other) and write the result to the database -with `db.put(key, value)`. Client libraries can make this process more -ergonomic by accepting a helper function to select between multiple nodes. -Libraries can also offer an option to either directly return the value of a -single node (instead of the node itself), or raise an error; this is likely to -be more ergonomic for applications which do not intend to support multiple -writers per database. - -`db.authorize(key)` will write metadata to the local feed authorizing the new -feed (corresponding to `key`) to be included in the database. Once authorized, -a feed may further authorize additional feeds (recursively). - -`db.authorized(key)` (returning a boolean) indicates whether the given `key` is -an authorized writer to the hyperdb database. - -At the time of this DEP there is no mechanism for revoking authorization. - -[dep-hyperdb]: https://github.com/datprotocol/DEPs/blob/master/proposals/0004-hyperdb.md - - -## Scaling -[scaling]: #scaling - -There is some overhead associated with each "writer" added to the feed, -impacting the number of files on disk, memory use, and the computational cost -of some lookup oprations. The design should easily accomodate dozens of -writers, and should scale to 1,000 writers without too much additional -overhead. Note that a large number of writers also implies a larger number and -rate of append operations, and additional network connections, which may cause -scaling issues on their own. More real-world experience and benchmarking is -needed in this area. - - -# Implementation Details -[reference-documentation]: #reference-documentation - -The complete protobuf schemas for the hyperdb "Entry" and "InflatedEntry" -message types (as specified in the hyperdb DEP) are: - - -``` -message Entry { - required string key = 1; - optional bytes value = 2; - optional bool deleted = 3; - required bytes trie = 4; - repeated uint64 clock = 5; - optional uint64 inflate = 6; -} - -message InflatedEntry { - message Feed { - required bytes key = 1; - } - - required string key = 1; - optional bytes value = 2; - optional bool deleted = 3; - required bytes trie = 4; - repeated uint64 clock = 5; - optional uint64 inflate = 6; - repeated Feed feeds = 7; - optional bytes contentFeed = 8; -} -``` - -The fields of interest for multi-writer are: - -- `clock`: a "vector clock" to record observed state at the time of writing the - entry. Included in every Entry and InflatedEntry. -- `inflate`: a back-pointer to the entry index of the most recent InflatedEntry - (containing a feed metadata change). Included in every Entry and - InflatedEntry. Should not be included for the very first entry in a feed - (which is an InflatedEntry). -- `feeds`: list of public keys for each writer's feed. Only included in - InflatedEntry, and only when feeds have changed. Does include a - self-reference to the current (local) Feed's key, always as the first - element. - -When serialized on disk in a SLEEP directory: - -- `source/`: the original feed, as created or cloned by this writer -- `local/`: if "local" feed is different from "source", it goes here -- `peers//`: all other writers' feed go under this directory (the - discovery key is lower-case hex-encoded) -- `content//`: if a higher-level protocol is being used that - uses multiple linked hypercore feeds (eg, hyperdrive), the linked "content" - feeds all go under this directory - -## Feeds and Vector Clocks - -At any point in time, each writer has a potentially unique view of the -"current" state of the database as a whole; this is the nature of real-world -distributed systems. For example, a given write might have the most recent -appends from one peer (eg, only seconds old), but be missing much older appends -from another (eg, days or weeks out of date). By having each writer include -metadata about their percieved state of the system as a whole in operations to -their Feed, all writers are able to collectively converge on an "eventually -consistent" picture of the database as whole (this process will be described in -the next section). - -A writer's "current known state" of the database consists of the set of active -Feeds, and for each the most recent entry sequence number ("version"). This -state can be serialized as an array of integers, refered to as a [vector -clock](https://en.wikipedia.org/wiki/Vector_clock). - -Each `put()` operation on the database appends a node to the writer's `local` -feed, and contains the writer's vector clock as observed at that time. -`InflatedEntry` nodes also contain a list of all known authorized Feeds; -inflated nodes only need to be written when the Feed list changes. Every -non-inflated entry contains a pointer back to the most recent inflated entry; -inflated entries themselves contain a pointer back to the previous inflated -entry (the first inflated entry has a null pointer). Elements of a vector clock -are ordered by the Feed list from the corresponding Inflated entry. - -By convention, the order of Feed lists is to start with the writer's local -feed first, then proceed by the order in which Feeds were discovered. Note that -this ordering is not consistent across writers, only within the same feed. - -As an example, if a node (non-inflated entry) had a vector clock of `[0, 2, -5]`, that would mean: - -- when this node was written, the largest seq # in the writer's local fed was 0 -- when this node was written, the largest seq # in the second known feed was 2 -- when this node was written, the largest seq # in the third known feed was 5 - - -## Multi-Feed Aware hyperdb -[multi-aware]: #multi-aware - -The [hyperdb DEP](hyperdb-dep) specifies the procedures for lookup (`get()`) -and append (`put()`) operations to the database, as well as binary encoding -schemes for entry messages. - -Note that the trie encoding specifies pointers in a `(feed, entry)` pair -format. The `feed` integer is an index into the most recent Feed list (found in -the most recent inflated entry; see the last section). When working with a -multi-writer hyperdb database, simply look up entries in the appropriate feed, -instead of only looking in the current feed. The next section ("Consistent -History") describes which entry (or entries) to start with instead of simply -assuming the most recent entry from the local feed. - - -## Consistent History -[consist-history]: #consist-history - -The set of all appended nodes in all feeds of a hyperdb, and all the vector -clock pointers between them, forms a "directed acyclic graph" (DAG). Any node -which does not have another node pointing to it is called a "head" (this -terminology is similar to that used in git). At any point in time, an observed -copy of a database has one or more heads, each representing the top of a -tree-like graph. In the trivial case of a non-multi-writer hyperdb, there is -always only a single head: the most recent entry in the local feed. Just after -appending to the local feed, there is also always a single head, because that -node's vector clock will reference all know most recent entries from other -feeds. It is only when nodes are appended by separate writers who did not know -of the others' concurrent action (and then these changes are replicated) that -there are multiple heads. - -When operating on the database (eg, executing a `get()` operation), all heads -must be considered. The lookup proceedure documented in the [hyperdb -DEP](hyperdb-dep) must be repeated for each head, and nodes returned -representing the set of all unique values. - -The situation where a `get()` operation multiple heads returns different values -for the same key is called a "conflict" and requires a "merge" to resolve. Some -writer (either a human being or application-layer code) must decide on the -correct value for the key and write that value as a new entry (with a vector -clock that includes the previous heads). The procedure for chosing the best -value to use in a given conflict is sometimes easy to determine, but is -impossible to determine algorithmically in the general case. See the "Usage" -section for more details. - - -# Examples -[examples]: #examples - -Let's say Alice starts a new hyperdb and writes two key/value entries to it: - -``` -// Operations -Alice: db.put('/foo/bar', 'baz') -Alice: db.put('/foo/2', '{"json":3}') - -// Alice's Feed -0 (key='/foo/bar', value='baz', - vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) -1 (key='/foo/2', value='{"json":3}', - vector_clock=[0], inflated=0) - -// Graph -Alice: 0 <--- 1 -``` - -The vector clock at `seq=1` points back to `seq=0`. - -Next Alice *authorizes* Bob to write to the database. Internally, this means Alice -writes an Inflated entry to her feed that contains Bob's Feed (identified by his -public key) in her feed list. - -``` -// Operations -Alice: db.authorize('b0b123...') - -// Alice's Feed -0 (key='/foo/bar', value='baz', - vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) -1 (key='/foo/2', value='{"json":3}', - vector_clock=[0], inflated=0) -2 (key=null, value=null, - vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) - -// Graph -Alice: 0 <--- 1 <--- 2 -``` - -Bob writes a value to his feed, and then Alice and Bob sync. The result is: - -``` -// Operations -Bob: db.put('/a/b', '12) - -// Alice's Feed -0 (key='/foo/bar', value='baz', - vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) -1 (key='/foo/2', value='{"json":3}', - vector_clock=[0], inflated=0) -2 (key=null, value=null, - vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) - -// Bob's Feed -0 (key='/a/b', value='12', - vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry)) - -// Graph -Alice: 0 <--- 1 <--- 2 -Bob : 0 -``` - -Notice that none of Alice's entries refer to Bob's, and vice versa. Neither has -written any entries to their feeds since the two became aware of each other. -Right now there are two "heads" of the graph: Alice's feed at seq 2, and Bob's -feed at seq 0. Any `get()` operations would need to descend from both heads, -though in this situation there would be no conflicts as the keys in the two -feeds are disjoint. - -Next, Alice writes a new value, and her latest entry will refer to Bob's: - -``` -// Operations -Alice: db.put('/foo/hup', 'beep') - -// Alice's Feed -0 (key='/foo/bar', value='baz', - vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) -1 (key='/foo/2', value='{"json":3}', - vector_clock=[0], inflated=0) -2 (key=null, value=null, - vector_clock=[1, null], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) -3 (key='/foo/hup', value='beep', - vector_clock=[2,0], inflated=2) - -// Bob's Feed -0 (key='/a/b', value='12', - vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry)) - - -// Graph -Alice: 0 <--- 1 <--- 2 <--/ 3 -Bob : 0 <-------------------/ -``` - -Alice's latest feed entry now points to Bob's latest feed entry, and there is -only one "head" in the database. This means that any `get()` operations only -need to run once, starting at `seq=3` in Alice's feed. - - -# Security and Privacy Concerns -[privacy]: #privacy - -As noted above, there is no existing mechanism for removing authorization for a -feed once added, and an authorized feed may recursively authorize additional -feeds. There is also no mechanism to restrict the scope of an authorized feed's -actions (eg, limit to only a specific path prefix). This leaves application -designers and users with few tools to control trust or access ("all or -nothing"). Care must be taken in particular if self-mutating software is being -distributed via hyperdb, or when action may be taken automatically based on the -most recent content of a database (eg, bots or even third-party tools may -publish publicly, or even take real-world action like controlling an electrical -relay). - -There is no mechanism to remove malicious history (or any history for that -matter); if an authorized (but hostile) writer appends a huge number of key -operations (bloating hyperdb metadata size), or posts offensive or illegal -content to a database, there is no way to permanently remove the data without -creating an new database. - -The read semantics of hyperdb are unchanged from hypercore: an actor does not -need to be "authorized" (for writing) to read the full history of a database, -they only need the public key. - -As noted in other DEPs, a malicious writer can potentially execute a denial of -service (DoS) attack by appending hyperdb entries that for a cyclic loop of -references. - - -# Drawbacks -[drawbacks]: #drawbacks - -Mutli-writer capability incurs a non-trivial increase in library, application, -and user experience complexity. For many applications, collaboration is an -essential feature, and the complexity is easily justified. To minimize -complexity for applications which do not need multi-writer features, -implementation authors should consider configuration modes which hide the -complexity of unused features. For example, by having an option to returning a -single node for a `get()` (and throw an error if there is a conflict), or a -flag to throw an error if a database unexpectedly contains more than a single -feed. - -Two tasks (conflict merges and secure key distribution) are left to application -developers. Both of these are Hard Problems. The current design mitigates the -former by reducing the number of merge conflicts that need to be handled by an -application (aka, only the non-trivial ones need to be handled), and -implementation authors are encouraged to provide an ergonomic API for writing -conflict resolvers. The second problem (secure key distribution) is out of -scope for this DEP. It is hoped that at least one pattern or library will -emerge from the Dat ecosystem such that each application author doesn't need to -invent a solution from scratch. - - -# Rationale and alternatives -[alternatives]: #alternatives - -Design goals for hyperdb (including the multi-writer feature) included: - -- ability to execute operations (get, put) with a sparse (partial) replication - of the database, using as few additional network requests as possible -- minimal on-disk and on-wire overhead -- implemented on top of an append-only log (to build on top of hypercore) - -If a solution for core use cases like collaboration and multi-device -synchronization is not provided at a low level (as this DEP provides), each -application will need to invent a solution at a higher level, incuring -duplicated effort and a higher risk of bugs. - -As an alternative to CRDTs, Operational Transformation (OT) has a reputation -for being more difficult to understand and implement. - - -# Unresolved questions -[unresolved]: #unresolved-questions - -What is the actual on-disk layout (folder structure), if not what is documented -here? - -The local feed's sequence number could skipped from vector clocks, because it's -implied by the sequence number of the hypercore entry itself. Same with the key -in the feed list (for inflated entries). In both cases, the redundant data is -retained for simplicity. - -If there are multiple heads, but they all return the same `value` for a `get()` -operation, how is it decided which `node` will be returned? AKA, values are the -same, but node metadata might not be (order of vector clock, etc). - -Suspect that some details are off in the example: shouldn't the InflatedEntry -authorizing a new feed include a vector clock reference to a specific seq in -that feed? Should new local (not yet authorized) feeds reference -their source in an initial InflatedEntry (eg, Bob at seq=0)? Should the first -InflatedEntry in a feed point to itself in it's vector clock? - -# Changelog -[changelog]: #changelog - -As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb -nodejs module on [github](https://github.com/mafintosh/hyperdb), which includes -multi-writer features and is the basis for this DEP. - -Jim Pick (@jimpick) has been an active contributor working out multi-writer details. - -- 2017-12-06: @noffle publishes `ARCHITECTURE.md` overview in the - [hyperdb github repo][arch_md] -- 2018-06-XX: First partial draft submitted for review - -[arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md diff --git a/proposals/0008-multiwriter.md b/proposals/0008-multiwriter.md new file mode 100644 index 0000000..8047ca4 --- /dev/null +++ b/proposals/0008-multiwriter.md @@ -0,0 +1,512 @@ + +Title: **DEP-0008: Multi-Writer** + +Short Name: `0008-multiwriter` + +Type: Standard + +Status: Draft (as of 2018-07-06) + +Github PR: [Draft](https://github.com/datprotocol/DEPs/pull/10) + +Authors: +[Bryan Newbold](https://github.com/bnewbold), +[Stephen Whitmore](https://github.com/noffle), +[Mathias Buus](https://github.com/mafintosh) + + +# Summary +[summary]: #summary + +Multi-Writer is a set of schema, API, and feature extentions to allow multiple +agents (users, devices, or software) to write to the same hyperdb database. By +building on top of this abstraction layer, future versions of hyperdrive and +Dat will gain these features. + +Mechanisms for distributed consistency and granting trust are specified here; +the need for merge conflict algorithms and secure key distribution are +mentioned but specific solutions are not specified. + +This DEP forms the second half of the hyperdb specification; the first half +covered only the key/value database aspects of hyperdb. + + +# Motivation +[motivation]: #motivation + +The current hypercore/Dat ecosystem currently lacks solutions for two +fundamental use cases: + +- individual users should be able to modify distributed archives under their + control from multiple devices, at a minimum to prevent loss of control of + content if a single device (containing secret keys) is lost +- contributions from and collaboration between multiple users on a single + archive or database should be possible, with appropriate trust and access + control semantics + +Access to a single secret key is currently required to make any change to a +hypercore feed, and it is broadly considered best practice not to distribute +secret keys between multiple users or multiple devices. In fact, the current +hypercore implementation has no mechanism to resolve disputes or recover if +multiple agents used the same secret key to append to the same feed. + +Solutions to these two use cases are seen as essential for many current and +future Dat ecosystem applications. + + +# Concepts, Behavior, and Usage +[usage-documentation]: #usage-documentation + +The multi-writer features of hyperdb are implemented by storing and replicating +the contributions of each writer in a separate hypercore feed. This +specification concerns itself with the details of how changes from multiple +feeds (which may be written and replicated concurrently or asynchronously) are +securely combined to present a unified key/value interface. + +The following related concerns are explicitly left to application developers to +design and implement: + +- secure key distribution and authentication (eg, if a friend should be given + write access to a hyperdb database, how is that friend's feed key found and + verified?) +- merge conflict resolution (using the provided API), potentially using + application-layer semantics + +Before we go any further, a few definitions: + +*Feed*: A hypercore feed: an append-only log of *Entries*, which can be +arbitrary data blobs. + +*Database*: in this context, a Hyperdb key/value database. Built from several +Feeds (two Feeds per Writer). + +*Writer*: a user (or user controlled device or software agent) that has a +distinct feed with a public/private key pair, and thus the ability to append +hypercore entries and "write" changes to their version of the database. + +*Original Writer*: the writer who created a given hyperdb database in the form +of the *Original Feed*. The public key of the original feed is the one used to +reference the database as a collection of feeds (eg, for the purpose of +discovery). + +At a high level, multi-writer hyperdb works by having existing authorized +writers (starting with the original writer) include authorization of new +writers by appending metadata to their own feed which points to the new feeds +(by public key). Each entry in each writer's feed contains "clock" metadata +that records the known state of the entire database (all writers) seen from the +perspective of that writer at the time they created the entry, in the form of +"clock" version pointers. This metadata (a "[vector clock][vc]") can be used by +other writers to resolve (or at least identify) conflicting content in the +database. The technical term for this type of system is a "Conflict-free +replicated data type" ([CRDT][crdt]), and specifically an "Operation-based" (as +opposed to "State-based") CRDT. + +[vc]: https://en.wikipedia.org/wiki/Vector_clock +[crdt]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type + + +## Core API +[api]: #api + +A "node" is a data structure representing a database entry, including the +`key`, `value`, and feed that the entry is commited to. + +`db.get(key)` (as described in the [hyperdb DEP][dep-hyperdb]) +returns an array of nodes. If it is unambiguous what the consistent state of a +key is, the array will have only that one value. If there is a conflict +(ambiguity), multiple nodes will be returned. If a key has unambiguously been +removed from the database, a "null" or empty datatype is returned. If one +branch of a conflict has a deletion (but at least one of the others does not), +a node with the `deleted` flag will be returned; note that such "tombstone" +nodes can still have a `value` field, which may contain application-specific +metadata (such as a self-reported timestamp), which may help resolve the +conflict. + +If multiple nodes are returned from a `get`, the writer should attempt to merge +the values (or chose one over the other) and write the result to the database +with `db.put(key, value)`. Client libraries can make this process more +ergonomic by accepting a helper function to select between multiple nodes. +Libraries can also offer an option to either directly return the value of a +single node (instead of the node itself), or raise an error; this is likely to +be more ergonomic for applications which do not intend to support multiple +writers per database. + +`db.authorize(key)` will write metadata to the local feed authorizing the new +feed (corresponding to `key`) to be included in the database. Once authorized, +a feed may further authorize additional feeds (recursively). + +`db.authorized(key)` (returning a boolean) indicates whether the given `key` is +an authorized writer to the hyperdb database. + +At the time of this DEP there is no mechanism for revoking authorization. + +[dep-hyperdb]: https://github.com/datprotocol/DEPs/blob/master/proposals/0004-hyperdb.md + + +## Scaling +[scaling]: #scaling + +There is some overhead associated with each "writer" added to the feed, +impacting the number of files on disk, memory use, and the computational cost +of some lookup oprations. The design should easily accomodate dozens of +writers, and should scale to 1,000 writers without too much additional +overhead. Note that a large number of writers also implies a larger number and +rate of append operations, and additional network connections, which may cause +scaling issues on their own. More real-world experience and benchmarking is +needed in this area. + + +# Implementation Details +[reference-documentation]: #reference-documentation + +The complete protobuf schemas for the hyperdb "Entry" and "InflatedEntry" +message types (as specified in the hyperdb DEP) are: + + +``` +message Entry { + required string key = 1; + optional bytes value = 2; + optional bool deleted = 3; + required bytes trie = 4; + repeated uint64 clock = 5; + optional uint64 inflate = 6; +} + +message InflatedEntry { + message Feed { + required bytes key = 1; + } + + required string key = 1; + optional bytes value = 2; + optional bool deleted = 3; + required bytes trie = 4; + repeated uint64 clock = 5; + optional uint64 inflate = 6; + repeated Feed feeds = 7; + optional bytes contentFeed = 8; +} +``` + +The fields of interest for multi-writer are: + +- `clock`: a "vector clock" to record observed state at the time of writing the + entry. Included in every Entry and InflatedEntry. +- `inflate`: a back-pointer to the entry index of the most recent InflatedEntry + (containing a feed metadata change). Included in every Entry and + InflatedEntry. Should not be included for the very first entry in a feed + (which is an InflatedEntry). +- `feeds`: list of public keys for each writer's feed. Only included in + InflatedEntry, and only when feeds have changed. Does include a + self-reference to the current (local) Feed's key, always as the first + element. + +When serialized on disk in a SLEEP directory: + +- `source/`: the original feed, as created or cloned by this writer +- `local/`: if "local" feed is different from "source", it goes here +- `peers//`: all other writers' feed go under this directory (the + discovery key is lower-case hex-encoded) +- `content//`: if a higher-level protocol is being used that + uses multiple linked hypercore feeds (eg, hyperdrive), the linked "content" + feeds all go under this directory + +## Feeds and Vector Clocks + +At any point in time, each writer has a potentially unique view of the +"current" state of the database as a whole; this is the nature of real-world +distributed systems. For example, a given write might have the most recent +appends from one peer (eg, only seconds old), but be missing much older appends +from another (eg, days or weeks out of date). By having each writer include +metadata about their percieved state of the system as a whole in operations to +their Feed, all writers are able to collectively converge on an "eventually +consistent" picture of the database as whole (this process will be described in +the next section). + +A writer's "current known state" of the database consists of the set of active +Feeds, and for each the most recent entry sequence number ("version"). This +state can be serialized as an array of integers, refered to as a [vector +clock](https://en.wikipedia.org/wiki/Vector_clock). + +Each `put()` operation on the database appends a node to the writer's `local` +feed, and contains the writer's vector clock as observed at that time. +`InflatedEntry` nodes also contain a list of all known authorized Feeds; +inflated nodes only need to be written when the Feed list changes. Every +non-inflated entry contains a pointer back to the most recent inflated entry; +inflated entries themselves contain a pointer back to the previous inflated +entry (the first inflated entry has a null pointer). Elements of a vector clock +are ordered by the Feed list from the corresponding Inflated entry. + +By convention, the order of Feed lists is to start with the writer's local +feed first, then proceed by the order in which Feeds were discovered. Note that +this ordering is not consistent across writers, only within the same feed. + +As an example, if a node (non-inflated entry) had a vector clock of `[0, 2, +5]`, that would mean: + +- when this node was written, the largest seq # in the writer's local fed was 0 +- when this node was written, the largest seq # in the second known feed was 2 +- when this node was written, the largest seq # in the third known feed was 5 + + +## Multi-Feed Aware hyperdb +[multi-aware]: #multi-aware + +The [hyperdb DEP](hyperdb-dep) specifies the procedures for lookup (`get()`) +and append (`put()`) operations to the database, as well as binary encoding +schemes for entry messages. + +Note that the trie encoding specifies pointers in a `(feed, entry)` pair +format. The `feed` integer is an index into the most recent Feed list (found in +the most recent inflated entry; see the last section). When working with a +multi-writer hyperdb database, simply look up entries in the appropriate feed, +instead of only looking in the current feed. The next section ("Consistent +History") describes which entry (or entries) to start with instead of simply +assuming the most recent entry from the local feed. + + +## Consistent History +[consist-history]: #consist-history + +The set of all appended nodes in all feeds of a hyperdb, and all the vector +clock pointers between them, forms a "directed acyclic graph" (DAG). Any node +which does not have another node pointing to it is called a "head" (this +terminology is similar to that used in git). At any point in time, an observed +copy of a database has one or more heads, each representing the top of a +tree-like graph. In the trivial case of a non-multi-writer hyperdb, there is +always only a single head: the most recent entry in the local feed. Just after +appending to the local feed, there is also always a single head, because that +node's vector clock will reference all know most recent entries from other +feeds. It is only when nodes are appended by separate writers who did not know +of the others' concurrent action (and then these changes are replicated) that +there are multiple heads. + +When operating on the database (eg, executing a `get()` operation), all heads +must be considered. The lookup proceedure documented in the [hyperdb +DEP](hyperdb-dep) must be repeated for each head, and nodes returned +representing the set of all unique values. + +The situation where a `get()` operation multiple heads returns different values +for the same key is called a "conflict" and requires a "merge" to resolve. Some +writer (either a human being or application-layer code) must decide on the +correct value for the key and write that value as a new entry (with a vector +clock that includes the previous heads). The procedure for chosing the best +value to use in a given conflict is sometimes easy to determine, but is +impossible to determine algorithmically in the general case. See the "Usage" +section for more details. + + +# Examples +[examples]: #examples + +Let's say Alice starts a new hyperdb and writes two key/value entries to it: + +``` +// Operations +Alice: db.put('/foo/bar', 'baz') +Alice: db.put('/foo/2', '{"json":3}') + +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) + +// Graph +Alice: 0 <--- 1 +``` + +The vector clock at `seq=1` points back to `seq=0`. + +Next Alice *authorizes* Bob to write to the database. Internally, this means Alice +writes an Inflated entry to her feed that contains Bob's Feed (identified by his +public key) in her feed list. + +``` +// Operations +Alice: db.authorize('b0b123...') + +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) +2 (key=null, value=null, + vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) + +// Graph +Alice: 0 <--- 1 <--- 2 +``` + +Bob writes a value to his feed, and then Alice and Bob sync. The result is: + +``` +// Operations +Bob: db.put('/a/b', '12) + +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) +2 (key=null, value=null, + vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) + +// Bob's Feed +0 (key='/a/b', value='12', + vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry)) + +// Graph +Alice: 0 <--- 1 <--- 2 +Bob : 0 +``` + +Notice that none of Alice's entries refer to Bob's, and vice versa. Neither has +written any entries to their feeds since the two became aware of each other. +Right now there are two "heads" of the graph: Alice's feed at seq 2, and Bob's +feed at seq 0. Any `get()` operations would need to descend from both heads, +though in this situation there would be no conflicts as the keys in the two +feeds are disjoint. + +Next, Alice writes a new value, and her latest entry will refer to Bob's: + +``` +// Operations +Alice: db.put('/foo/hup', 'beep') + +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) +2 (key=null, value=null, + vector_clock=[1, null], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) +3 (key='/foo/hup', value='beep', + vector_clock=[2,0], inflated=2) + +// Bob's Feed +0 (key='/a/b', value='12', + vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry)) + + +// Graph +Alice: 0 <--- 1 <--- 2 <--/ 3 +Bob : 0 <-------------------/ +``` + +Alice's latest feed entry now points to Bob's latest feed entry, and there is +only one "head" in the database. This means that any `get()` operations only +need to run once, starting at `seq=3` in Alice's feed. + + +# Security and Privacy Concerns +[privacy]: #privacy + +As noted above, there is no existing mechanism for removing authorization for a +feed once added, and an authorized feed may recursively authorize additional +feeds. There is also no mechanism to restrict the scope of an authorized feed's +actions (eg, limit to only a specific path prefix). This leaves application +designers and users with few tools to control trust or access ("all or +nothing"). Care must be taken in particular if self-mutating software is being +distributed via hyperdb, or when action may be taken automatically based on the +most recent content of a database (eg, bots or even third-party tools may +publish publicly, or even take real-world action like controlling an electrical +relay). + +There is no mechanism to remove malicious history (or any history for that +matter); if an authorized (but hostile) writer appends a huge number of key +operations (bloating hyperdb metadata size), or posts offensive or illegal +content to a database, there is no way to permanently remove the data without +creating an new database. + +The read semantics of hyperdb are unchanged from hypercore: an actor does not +need to be "authorized" (for writing) to read the full history of a database, +they only need the public key. + +As noted in other DEPs, a malicious writer can potentially execute a denial of +service (DoS) attack by appending hyperdb entries that for a cyclic loop of +references. + + +# Drawbacks +[drawbacks]: #drawbacks + +Mutli-writer capability incurs a non-trivial increase in library, application, +and user experience complexity. For many applications, collaboration is an +essential feature, and the complexity is easily justified. To minimize +complexity for applications which do not need multi-writer features, +implementation authors should consider configuration modes which hide the +complexity of unused features. For example, by having an option to returning a +single node for a `get()` (and throw an error if there is a conflict), or a +flag to throw an error if a database unexpectedly contains more than a single +feed. + +Two tasks (conflict merges and secure key distribution) are left to application +developers. Both of these are Hard Problems. The current design mitigates the +former by reducing the number of merge conflicts that need to be handled by an +application (aka, only the non-trivial ones need to be handled), and +implementation authors are encouraged to provide an ergonomic API for writing +conflict resolvers. The second problem (secure key distribution) is out of +scope for this DEP. It is hoped that at least one pattern or library will +emerge from the Dat ecosystem such that each application author doesn't need to +invent a solution from scratch. + + +# Rationale and alternatives +[alternatives]: #alternatives + +Design goals for hyperdb (including the multi-writer feature) included: + +- ability to execute operations (get, put) with a sparse (partial) replication + of the database, using as few additional network requests as possible +- minimal on-disk and on-wire overhead +- implemented on top of an append-only log (to build on top of hypercore) + +If a solution for core use cases like collaboration and multi-device +synchronization is not provided at a low level (as this DEP provides), each +application will need to invent a solution at a higher level, incuring +duplicated effort and a higher risk of bugs. + +As an alternative to CRDTs, Operational Transformation (OT) has a reputation +for being more difficult to understand and implement. + + +# Unresolved questions +[unresolved]: #unresolved-questions + +What is the actual on-disk layout (folder structure), if not what is documented +here? + +The local feed's sequence number could skipped from vector clocks, because it's +implied by the sequence number of the hypercore entry itself. Same with the key +in the feed list (for inflated entries). In both cases, the redundant data is +retained for simplicity. + +If there are multiple heads, but they all return the same `value` for a `get()` +operation, how is it decided which `node` will be returned? AKA, values are the +same, but node metadata might not be (order of vector clock, etc). + +Suspect that some details are off in the example: shouldn't the InflatedEntry +authorizing a new feed include a vector clock reference to a specific seq in +that feed? Should new local (not yet authorized) feeds reference +their source in an initial InflatedEntry (eg, Bob at seq=0)? Should the first +InflatedEntry in a feed point to itself in it's vector clock? + +# Changelog +[changelog]: #changelog + +As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb +nodejs module on [github](https://github.com/mafintosh/hyperdb), which includes +multi-writer features and is the basis for this DEP. + +Jim Pick (@jimpick) has been an active contributor working out multi-writer details. + +- 2017-12-06: @noffle publishes `ARCHITECTURE.md` overview in the + [hyperdb github repo][arch_md] +- 2018-05-02: First round of public review +- 2018-05-23: hyperdb 3.0.0 node.js implementation released +- 2018-06-10: Second draft submitted for review +- 2018-07-06: Accepted with Draft status (after edits) + +[arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md -- cgit v1.2.3