From 96eb0d32363deaa085bc133db1b6fda25cff2b55 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 6 Jul 2018 13:39:56 -0400 Subject: multiwriter as DEP-0008 --- proposals/0008-multiwriter.md | 512 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 512 insertions(+) create mode 100644 proposals/0008-multiwriter.md (limited to 'proposals/0008-multiwriter.md') diff --git a/proposals/0008-multiwriter.md b/proposals/0008-multiwriter.md new file mode 100644 index 0000000..8047ca4 --- /dev/null +++ b/proposals/0008-multiwriter.md @@ -0,0 +1,512 @@ + +Title: **DEP-0008: Multi-Writer** + +Short Name: `0008-multiwriter` + +Type: Standard + +Status: Draft (as of 2018-07-06) + +Github PR: [Draft](https://github.com/datprotocol/DEPs/pull/10) + +Authors: +[Bryan Newbold](https://github.com/bnewbold), +[Stephen Whitmore](https://github.com/noffle), +[Mathias Buus](https://github.com/mafintosh) + + +# Summary +[summary]: #summary + +Multi-Writer is a set of schema, API, and feature extentions to allow multiple +agents (users, devices, or software) to write to the same hyperdb database. By +building on top of this abstraction layer, future versions of hyperdrive and +Dat will gain these features. + +Mechanisms for distributed consistency and granting trust are specified here; +the need for merge conflict algorithms and secure key distribution are +mentioned but specific solutions are not specified. + +This DEP forms the second half of the hyperdb specification; the first half +covered only the key/value database aspects of hyperdb. + + +# Motivation +[motivation]: #motivation + +The current hypercore/Dat ecosystem currently lacks solutions for two +fundamental use cases: + +- individual users should be able to modify distributed archives under their + control from multiple devices, at a minimum to prevent loss of control of + content if a single device (containing secret keys) is lost +- contributions from and collaboration between multiple users on a single + archive or database should be possible, with appropriate trust and access + control semantics + +Access to a single secret key is currently required to make any change to a +hypercore feed, and it is broadly considered best practice not to distribute +secret keys between multiple users or multiple devices. In fact, the current +hypercore implementation has no mechanism to resolve disputes or recover if +multiple agents used the same secret key to append to the same feed. + +Solutions to these two use cases are seen as essential for many current and +future Dat ecosystem applications. + + +# Concepts, Behavior, and Usage +[usage-documentation]: #usage-documentation + +The multi-writer features of hyperdb are implemented by storing and replicating +the contributions of each writer in a separate hypercore feed. This +specification concerns itself with the details of how changes from multiple +feeds (which may be written and replicated concurrently or asynchronously) are +securely combined to present a unified key/value interface. + +The following related concerns are explicitly left to application developers to +design and implement: + +- secure key distribution and authentication (eg, if a friend should be given + write access to a hyperdb database, how is that friend's feed key found and + verified?) +- merge conflict resolution (using the provided API), potentially using + application-layer semantics + +Before we go any further, a few definitions: + +*Feed*: A hypercore feed: an append-only log of *Entries*, which can be +arbitrary data blobs. + +*Database*: in this context, a Hyperdb key/value database. Built from several +Feeds (two Feeds per Writer). + +*Writer*: a user (or user controlled device or software agent) that has a +distinct feed with a public/private key pair, and thus the ability to append +hypercore entries and "write" changes to their version of the database. + +*Original Writer*: the writer who created a given hyperdb database in the form +of the *Original Feed*. The public key of the original feed is the one used to +reference the database as a collection of feeds (eg, for the purpose of +discovery). + +At a high level, multi-writer hyperdb works by having existing authorized +writers (starting with the original writer) include authorization of new +writers by appending metadata to their own feed which points to the new feeds +(by public key). Each entry in each writer's feed contains "clock" metadata +that records the known state of the entire database (all writers) seen from the +perspective of that writer at the time they created the entry, in the form of +"clock" version pointers. This metadata (a "[vector clock][vc]") can be used by +other writers to resolve (or at least identify) conflicting content in the +database. The technical term for this type of system is a "Conflict-free +replicated data type" ([CRDT][crdt]), and specifically an "Operation-based" (as +opposed to "State-based") CRDT. + +[vc]: https://en.wikipedia.org/wiki/Vector_clock +[crdt]: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type + + +## Core API +[api]: #api + +A "node" is a data structure representing a database entry, including the +`key`, `value`, and feed that the entry is commited to. + +`db.get(key)` (as described in the [hyperdb DEP][dep-hyperdb]) +returns an array of nodes. If it is unambiguous what the consistent state of a +key is, the array will have only that one value. If there is a conflict +(ambiguity), multiple nodes will be returned. If a key has unambiguously been +removed from the database, a "null" or empty datatype is returned. If one +branch of a conflict has a deletion (but at least one of the others does not), +a node with the `deleted` flag will be returned; note that such "tombstone" +nodes can still have a `value` field, which may contain application-specific +metadata (such as a self-reported timestamp), which may help resolve the +conflict. + +If multiple nodes are returned from a `get`, the writer should attempt to merge +the values (or chose one over the other) and write the result to the database +with `db.put(key, value)`. Client libraries can make this process more +ergonomic by accepting a helper function to select between multiple nodes. +Libraries can also offer an option to either directly return the value of a +single node (instead of the node itself), or raise an error; this is likely to +be more ergonomic for applications which do not intend to support multiple +writers per database. + +`db.authorize(key)` will write metadata to the local feed authorizing the new +feed (corresponding to `key`) to be included in the database. Once authorized, +a feed may further authorize additional feeds (recursively). + +`db.authorized(key)` (returning a boolean) indicates whether the given `key` is +an authorized writer to the hyperdb database. + +At the time of this DEP there is no mechanism for revoking authorization. + +[dep-hyperdb]: https://github.com/datprotocol/DEPs/blob/master/proposals/0004-hyperdb.md + + +## Scaling +[scaling]: #scaling + +There is some overhead associated with each "writer" added to the feed, +impacting the number of files on disk, memory use, and the computational cost +of some lookup oprations. The design should easily accomodate dozens of +writers, and should scale to 1,000 writers without too much additional +overhead. Note that a large number of writers also implies a larger number and +rate of append operations, and additional network connections, which may cause +scaling issues on their own. More real-world experience and benchmarking is +needed in this area. + + +# Implementation Details +[reference-documentation]: #reference-documentation + +The complete protobuf schemas for the hyperdb "Entry" and "InflatedEntry" +message types (as specified in the hyperdb DEP) are: + + +``` +message Entry { + required string key = 1; + optional bytes value = 2; + optional bool deleted = 3; + required bytes trie = 4; + repeated uint64 clock = 5; + optional uint64 inflate = 6; +} + +message InflatedEntry { + message Feed { + required bytes key = 1; + } + + required string key = 1; + optional bytes value = 2; + optional bool deleted = 3; + required bytes trie = 4; + repeated uint64 clock = 5; + optional uint64 inflate = 6; + repeated Feed feeds = 7; + optional bytes contentFeed = 8; +} +``` + +The fields of interest for multi-writer are: + +- `clock`: a "vector clock" to record observed state at the time of writing the + entry. Included in every Entry and InflatedEntry. +- `inflate`: a back-pointer to the entry index of the most recent InflatedEntry + (containing a feed metadata change). Included in every Entry and + InflatedEntry. Should not be included for the very first entry in a feed + (which is an InflatedEntry). +- `feeds`: list of public keys for each writer's feed. Only included in + InflatedEntry, and only when feeds have changed. Does include a + self-reference to the current (local) Feed's key, always as the first + element. + +When serialized on disk in a SLEEP directory: + +- `source/`: the original feed, as created or cloned by this writer +- `local/`: if "local" feed is different from "source", it goes here +- `peers//`: all other writers' feed go under this directory (the + discovery key is lower-case hex-encoded) +- `content//`: if a higher-level protocol is being used that + uses multiple linked hypercore feeds (eg, hyperdrive), the linked "content" + feeds all go under this directory + +## Feeds and Vector Clocks + +At any point in time, each writer has a potentially unique view of the +"current" state of the database as a whole; this is the nature of real-world +distributed systems. For example, a given write might have the most recent +appends from one peer (eg, only seconds old), but be missing much older appends +from another (eg, days or weeks out of date). By having each writer include +metadata about their percieved state of the system as a whole in operations to +their Feed, all writers are able to collectively converge on an "eventually +consistent" picture of the database as whole (this process will be described in +the next section). + +A writer's "current known state" of the database consists of the set of active +Feeds, and for each the most recent entry sequence number ("version"). This +state can be serialized as an array of integers, refered to as a [vector +clock](https://en.wikipedia.org/wiki/Vector_clock). + +Each `put()` operation on the database appends a node to the writer's `local` +feed, and contains the writer's vector clock as observed at that time. +`InflatedEntry` nodes also contain a list of all known authorized Feeds; +inflated nodes only need to be written when the Feed list changes. Every +non-inflated entry contains a pointer back to the most recent inflated entry; +inflated entries themselves contain a pointer back to the previous inflated +entry (the first inflated entry has a null pointer). Elements of a vector clock +are ordered by the Feed list from the corresponding Inflated entry. + +By convention, the order of Feed lists is to start with the writer's local +feed first, then proceed by the order in which Feeds were discovered. Note that +this ordering is not consistent across writers, only within the same feed. + +As an example, if a node (non-inflated entry) had a vector clock of `[0, 2, +5]`, that would mean: + +- when this node was written, the largest seq # in the writer's local fed was 0 +- when this node was written, the largest seq # in the second known feed was 2 +- when this node was written, the largest seq # in the third known feed was 5 + + +## Multi-Feed Aware hyperdb +[multi-aware]: #multi-aware + +The [hyperdb DEP](hyperdb-dep) specifies the procedures for lookup (`get()`) +and append (`put()`) operations to the database, as well as binary encoding +schemes for entry messages. + +Note that the trie encoding specifies pointers in a `(feed, entry)` pair +format. The `feed` integer is an index into the most recent Feed list (found in +the most recent inflated entry; see the last section). When working with a +multi-writer hyperdb database, simply look up entries in the appropriate feed, +instead of only looking in the current feed. The next section ("Consistent +History") describes which entry (or entries) to start with instead of simply +assuming the most recent entry from the local feed. + + +## Consistent History +[consist-history]: #consist-history + +The set of all appended nodes in all feeds of a hyperdb, and all the vector +clock pointers between them, forms a "directed acyclic graph" (DAG). Any node +which does not have another node pointing to it is called a "head" (this +terminology is similar to that used in git). At any point in time, an observed +copy of a database has one or more heads, each representing the top of a +tree-like graph. In the trivial case of a non-multi-writer hyperdb, there is +always only a single head: the most recent entry in the local feed. Just after +appending to the local feed, there is also always a single head, because that +node's vector clock will reference all know most recent entries from other +feeds. It is only when nodes are appended by separate writers who did not know +of the others' concurrent action (and then these changes are replicated) that +there are multiple heads. + +When operating on the database (eg, executing a `get()` operation), all heads +must be considered. The lookup proceedure documented in the [hyperdb +DEP](hyperdb-dep) must be repeated for each head, and nodes returned +representing the set of all unique values. + +The situation where a `get()` operation multiple heads returns different values +for the same key is called a "conflict" and requires a "merge" to resolve. Some +writer (either a human being or application-layer code) must decide on the +correct value for the key and write that value as a new entry (with a vector +clock that includes the previous heads). The procedure for chosing the best +value to use in a given conflict is sometimes easy to determine, but is +impossible to determine algorithmically in the general case. See the "Usage" +section for more details. + + +# Examples +[examples]: #examples + +Let's say Alice starts a new hyperdb and writes two key/value entries to it: + +``` +// Operations +Alice: db.put('/foo/bar', 'baz') +Alice: db.put('/foo/2', '{"json":3}') + +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) + +// Graph +Alice: 0 <--- 1 +``` + +The vector clock at `seq=1` points back to `seq=0`. + +Next Alice *authorizes* Bob to write to the database. Internally, this means Alice +writes an Inflated entry to her feed that contains Bob's Feed (identified by his +public key) in her feed list. + +``` +// Operations +Alice: db.authorize('b0b123...') + +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) +2 (key=null, value=null, + vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) + +// Graph +Alice: 0 <--- 1 <--- 2 +``` + +Bob writes a value to his feed, and then Alice and Bob sync. The result is: + +``` +// Operations +Bob: db.put('/a/b', '12) + +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) +2 (key=null, value=null, + vector_clock=[1], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) + +// Bob's Feed +0 (key='/a/b', value='12', + vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry)) + +// Graph +Alice: 0 <--- 1 <--- 2 +Bob : 0 +``` + +Notice that none of Alice's entries refer to Bob's, and vice versa. Neither has +written any entries to their feeds since the two became aware of each other. +Right now there are two "heads" of the graph: Alice's feed at seq 2, and Bob's +feed at seq 0. Any `get()` operations would need to descend from both heads, +though in this situation there would be no conflicts as the keys in the two +feeds are disjoint. + +Next, Alice writes a new value, and her latest entry will refer to Bob's: + +``` +// Operations +Alice: db.put('/foo/hup', 'beep') + +// Alice's Feed +0 (key='/foo/bar', value='baz', + vector_clock=[0], inflated=null, feeds=['a11ce...']) (InflatedEntry) +1 (key='/foo/2', value='{"json":3}', + vector_clock=[0], inflated=0) +2 (key=null, value=null, + vector_clock=[1, null], inflated=0, feeds=['a11ce...', 'b0b123...']) (InflatedEntry) +3 (key='/foo/hup', value='beep', + vector_clock=[2,0], inflated=2) + +// Bob's Feed +0 (key='/a/b', value='12', + vector_clock=[0], inflated=null, feeds=['b0b123...']) (InflatedEntry)) + + +// Graph +Alice: 0 <--- 1 <--- 2 <--/ 3 +Bob : 0 <-------------------/ +``` + +Alice's latest feed entry now points to Bob's latest feed entry, and there is +only one "head" in the database. This means that any `get()` operations only +need to run once, starting at `seq=3` in Alice's feed. + + +# Security and Privacy Concerns +[privacy]: #privacy + +As noted above, there is no existing mechanism for removing authorization for a +feed once added, and an authorized feed may recursively authorize additional +feeds. There is also no mechanism to restrict the scope of an authorized feed's +actions (eg, limit to only a specific path prefix). This leaves application +designers and users with few tools to control trust or access ("all or +nothing"). Care must be taken in particular if self-mutating software is being +distributed via hyperdb, or when action may be taken automatically based on the +most recent content of a database (eg, bots or even third-party tools may +publish publicly, or even take real-world action like controlling an electrical +relay). + +There is no mechanism to remove malicious history (or any history for that +matter); if an authorized (but hostile) writer appends a huge number of key +operations (bloating hyperdb metadata size), or posts offensive or illegal +content to a database, there is no way to permanently remove the data without +creating an new database. + +The read semantics of hyperdb are unchanged from hypercore: an actor does not +need to be "authorized" (for writing) to read the full history of a database, +they only need the public key. + +As noted in other DEPs, a malicious writer can potentially execute a denial of +service (DoS) attack by appending hyperdb entries that for a cyclic loop of +references. + + +# Drawbacks +[drawbacks]: #drawbacks + +Mutli-writer capability incurs a non-trivial increase in library, application, +and user experience complexity. For many applications, collaboration is an +essential feature, and the complexity is easily justified. To minimize +complexity for applications which do not need multi-writer features, +implementation authors should consider configuration modes which hide the +complexity of unused features. For example, by having an option to returning a +single node for a `get()` (and throw an error if there is a conflict), or a +flag to throw an error if a database unexpectedly contains more than a single +feed. + +Two tasks (conflict merges and secure key distribution) are left to application +developers. Both of these are Hard Problems. The current design mitigates the +former by reducing the number of merge conflicts that need to be handled by an +application (aka, only the non-trivial ones need to be handled), and +implementation authors are encouraged to provide an ergonomic API for writing +conflict resolvers. The second problem (secure key distribution) is out of +scope for this DEP. It is hoped that at least one pattern or library will +emerge from the Dat ecosystem such that each application author doesn't need to +invent a solution from scratch. + + +# Rationale and alternatives +[alternatives]: #alternatives + +Design goals for hyperdb (including the multi-writer feature) included: + +- ability to execute operations (get, put) with a sparse (partial) replication + of the database, using as few additional network requests as possible +- minimal on-disk and on-wire overhead +- implemented on top of an append-only log (to build on top of hypercore) + +If a solution for core use cases like collaboration and multi-device +synchronization is not provided at a low level (as this DEP provides), each +application will need to invent a solution at a higher level, incuring +duplicated effort and a higher risk of bugs. + +As an alternative to CRDTs, Operational Transformation (OT) has a reputation +for being more difficult to understand and implement. + + +# Unresolved questions +[unresolved]: #unresolved-questions + +What is the actual on-disk layout (folder structure), if not what is documented +here? + +The local feed's sequence number could skipped from vector clocks, because it's +implied by the sequence number of the hypercore entry itself. Same with the key +in the feed list (for inflated entries). In both cases, the redundant data is +retained for simplicity. + +If there are multiple heads, but they all return the same `value` for a `get()` +operation, how is it decided which `node` will be returned? AKA, values are the +same, but node metadata might not be (order of vector clock, etc). + +Suspect that some details are off in the example: shouldn't the InflatedEntry +authorizing a new feed include a vector clock reference to a specific seq in +that feed? Should new local (not yet authorized) feeds reference +their source in an initial InflatedEntry (eg, Bob at seq=0)? Should the first +InflatedEntry in a feed point to itself in it's vector clock? + +# Changelog +[changelog]: #changelog + +As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb +nodejs module on [github](https://github.com/mafintosh/hyperdb), which includes +multi-writer features and is the basis for this DEP. + +Jim Pick (@jimpick) has been an active contributor working out multi-writer details. + +- 2017-12-06: @noffle publishes `ARCHITECTURE.md` overview in the + [hyperdb github repo][arch_md] +- 2018-05-02: First round of public review +- 2018-05-23: hyperdb 3.0.0 node.js implementation released +- 2018-06-10: Second draft submitted for review +- 2018-07-06: Accepted with Draft status (after edits) + +[arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md -- cgit v1.2.3