diff options
-rw-r--r-- | package.json | 2 | ||||
-rw-r--r-- | papers/dat-paper.md | 288 | ||||
-rw-r--r-- | papers/dat-paper.pdf | bin | 251426 -> 256904 bytes |
3 files changed, 173 insertions, 117 deletions
diff --git a/package.json b/package.json index 83581ab..ccdff0e 100644 --- a/package.json +++ b/package.json @@ -14,7 +14,7 @@ "watch:css": "nodemon -e scss -x \"npm run build:local\"", "netlify": "npm run update:token && npm run build:deploy", "start": "budo --dir dist --pushstate", - "paper": "cd papers && pandoc --variable author=\"Maxwell Ogden, Karissa McKelvey, Mathias Buus\" --variable title=\"Dat - Distributed Dataset Synchronization And Versioning\" --variable date=\"DRAFT, April 2017\" --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -o dat-paper.pdf" + "paper": "cd papers && pandoc --variable author=\"Maxwell Ogden, Karissa McKelvey, Mathias Buus Madsen\" --variable title=\"Dat - Distributed Dataset Synchronization And Versioning\" --variable date=\"April 2017, Code for Science\" --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -o dat-paper.pdf" }, "repository": { "type": "git", diff --git a/papers/dat-paper.md b/papers/dat-paper.md index 8910e57..4b6482b 100644 --- a/papers/dat-paper.md +++ b/papers/dat-paper.md @@ -27,11 +27,11 @@ Content integrity means being able to verify the data you received is the exact Link rot, when links online stop resolving, and content drift, when data changes but the link to the data remains the same, are two common issues in data analysis. For example, one day a file called data.zip might change, but a typical HTTP link to the file does not include a hash of the content, or provide a way to get updated metadata, so clients that only have the HTTP link have no way to check if the file changed without downloading the entire file again. Referring to a file by the hash of its content is called content addressability, and lets users not only verify that the data they receive is the version of the data they want, but also lets people cite specific versions of the data by referring to a specific hash. -Dat uses BLAKE2b cryptographically secure hashes to address content. Hashes are arranged in a Merkle tree, a tree where each non-leaf node is the hash of all child nodes. Leaf nodes contain pieces of the dataset. Due to the properties of secure cryptographic hashes the top hash can only be produced if all data below it matches exactly. If two trees have matching top hashes then you know that all other nodes in the tree must match as well, and you can conclude that your dataset is synchronized. +Dat uses BLAKE2b cryptographically secure hashes to address content. Hashes are arranged in a Merkle tree, a tree where each non-leaf node is the hash of all child nodes. Leaf nodes contain pieces of the dataset. Due to the properties of secure cryptographic hashes the top hash can only be produced if all data below it matches exactly. If two trees have matching top hashes then you know that all other nodes in the tree must match as well, and you can conclude that your dataset is synchronized. Trees are chosen as the primary data structure in Dat as they have a number of properties that allow for efficient access to subsets of the metadata, which allows Dat to work efficiently over a network connection. ### Dat Links -Dat links are Ed25519 public keys which have a length of 32 bytes (64 characters when Base64 encoded). You can represent your Dat link in the following ways and Dat clients will be able to understand them: +Dat links are Ed25519 public keys which have a length of 32 bytes (64 characters when Hex encoded). You can represent your Dat link in the following ways and Dat clients will be able to understand them: - The standalone public key: @@ -87,18 +87,14 @@ In the Dat implementation we implement the above actions on top of three types o Additional discovery networks can be implemented as needed. We chose the above three as a starting point to have a complementary mix of strategies to increase the probability of source discovery. Additionally you can specify a Dat via HTTPS link, which runs the Dat protocol in "single-source" mode, meaning the above discovery networks are not used, and instead only that one HTTPS server is used as the only peer. -Our implementation of source discovery is called [discovery-channel](https://npmjs.org/discovery-channel). We also run a [custom DNS server](https://www.npmjs.com/package/dns-discovery) that Dat clients use (in addition to specifying their own if they need to), as well as a [DHT bootstrap](https://github.com/bittorrent/bootstrap-dht) server. These discovery servers are the only centralized infrastructure we need for Dat to work over the Internet, but they are redundant, interchangeable, never see the actual data being shared, anyone can run their own and Dat will still work even if they all are unavailable. If this happens discovery will just be manual (e.g. manually sharing IP/ports). - ### Peer Connections -After the discovery phase, Dat should have a list of potential data sources to try and contact. Dat uses either TCP, [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol), or HTTP. UTP is designed to not take up all available bandwidth on a network (e.g. so that other people sharing wifi can still use the Internet). HTTP is support for compatibility with static file servers and web browser clients. Note that these are the protocols we support in the reference Dat implementation, but the Dat protocol itself is transport agnostic. +After the discovery phase, Dat should have a list of potential data sources to try and contact. Dat uses either TCP, [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol), or HTTP. UTP is designed to not take up all available bandwidth on a network (e.g. so that other people sharing wifi can still use the Internet), and is still based on UDP so works with NAT traversal techniques like UDP hole punching. HTTP is support for compatibility with static file servers and web browser clients. Note that these are the protocols we support in the reference Dat implementation, but the Dat protocol itself is transport agnostic. If an HTTP source is specified Dat will prefer that one over other sources. Otherwise when Dat gets the IP and port for a potential TCP or UTP source it tries to connect using both protocols. If one connects first, Dat aborts the other one. If none connect, Dat will try again until it decides that source is offline or unavailable and then stops trying to connect to them. Sources Dat is able to connect to go into a list of known good sources, so that the Internet connection goes down Dat can use that list to reconnect to known good sources again quickly. If Dat gets a lot of potential sources it picks a handful at random to try and connect to and keeps the rest around as additional sources to use later in case it decides it needs more sources. -The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. It provides statistics such as how many sources are currently connected, how many good and bad behaving sources have been talked to, and it automatically handles connecting and reconnecting to sources. UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native). - Once a duplex binary connection to a remote source is open Dat then layers on the Hypercore protocol, a message based replication protocol that allows two peers to communicate over a stateless channel to request and exchange data. You open separate replication channels with many peers at once which allows clients to parallelize data requests across the entire pool of peers they have established connections with. ## 2.3 Network Privacy @@ -107,7 +103,7 @@ On the Web today, with SSL, there is a guarantee that the traffic between your c There is an inherent tradeoff in peer to peer systems of source discovery vs. user privacy. The more sources you contact and ask for some data, the more sources you trust to keep what you asked for private. Our goal is to have Dat be configurable in respect to this tradeoff to allow application developers to meet their own privacy guidelines. -It is up to client programs to make design decisions around which discovery networks they trust. For example if a Dat client decides to use the BitTorrent DHT to discover peers, and they are searching for a publicly shared Dat key with known contents, then because of the privacy design of the BitTorrent DHT it becomes public knowledge what key that client is searching for. +It is up to client programs to make design decisions around which discovery networks they trust. For example if a Dat client decides to use the BitTorrent DHT to discover peers, and they are searching for a publicly shared Dat key (e.g. a key cited publicly in a published scientific paper) with known contents, then because of the privacy design of the BitTorrent DHT it becomes public knowledge what key that client is searching for. A client could choose to only use discovery networks with certain privacy guarantees. For example a client could only connect to an approved list of sources that they trust, similar to SSL. As long as they trust each source, the encryption built into the Dat network protocol will prevent the Dat key they are looking for from being leaked. @@ -123,47 +119,19 @@ There are two types of versioning performed automatically by Dat. Metadata is st Dat tries as much as possible to act as a one-to-one mirror of the state of a folder and all it's contents. When importing files, Dat uses a sorted depth-first recursion to list all the files in the tree. For each file it finds, it grabs the filesystem metadata (filename, Stat object, etc) and checks if there is already an entry for this filename with this exact metadata already represented in the Dat repository metadata. If the file with this metadata matches exactly the newest version of the file metadata stored in Dat, then this file will be skipped (no change). -If the metadata differs from the current existing one (or there are no entries for this filename at all in the history), then this new metadata entry will be appended as the new 'latest' version for this file in the append-only SLEEP metadata content feed (described below). +If the metadata differs from the current existing one (or there are no entries for this filename at all in the history), then this new metadata entry will be appended as the new 'latest' version for this file in the append-only SLEEP metadata content register (described below). ### Content Versioning -In addition to storing a historical record of filesystem metadata, the content of the files themselves are also capable of being stored in a version controlled manner. The default storage system used in Dat simply stores the files as files. This has the advantage of being very straightforward for users to understand, but the downside of not storing old versions of content by default. +In addition to storing a historical record of filesystem metadata, the content of the files themselves are also capable of being stored in a version controlled manner. The default storage system used in Dat stores the files as files. This has the advantage of being very straightforward for users to understand, but the downside of not storing old versions of content by default. In contrast to other version control systems like Git, Dat by default only stores the current set of checked out files on disk in the repository folder, not old versions. It does store all previous metadata for old versions in `.dat`. Git for example stores all previous content versions and all previous metadata versions in the `.git` folder. Because Dat is designed for larger datasets, if it stored all previous file versions in `.dat`, then the `.dat` folder could easily fill up the users hard drive inadverntently. Therefore Dat has multiple storage modes based on usage. -The default behavior is to store the current files only, but you can also run Dat using other storage provider modules. If you want to run an 'archival' node that keeps all previous versions, you can use the `content-addressable-blob-store` module instead of the default `dat-storage` module. For example, on a shared server with lots of storage you probably want to store all versions. However on a workstation machine that is only accessing a subset of one version, the default mode of storing all metadata plus the current set of downloaded files is acceptable, because you know the server has the full history. - -## 3. SLEEP - -SLEEP is the the on-disk format that Dat produces and uses. It is a set of 9 files that hold all of the metadata needed to list the contents of a Dat repository and verify the integrity of the data you receive. SLEEP is designed to work with REST, allowing servers to be plain HTTP file servers serving the static SLEEP files, meaning you can implement a Dat protocol client using HTTP with a static HTTP file server as the backend. - -SLEEP files contain metadata about the data inside a Dat repository, including cryptographic hashes, cryptographic signatures, filenames and file permissions. The SLEEP format is specifically engineered to allow efficient access to subsets of the metadat and/or data in the repository, even on very large repositories, which enables Dat's peer to peer networking to be fast. - -The acronym SLEEP is a slumber related pun on REST and stands for Syncable Lightweight Event Emitting Persistence. The Event Emitting part refers to how SLEEP files are append-only in nature, meaning they grow over time and new updates can be subscribed to as a realtime feed of events through the Dat protocol. - -The SLEEP version described here, used in Dat as of 2017 is SLEEP V2. SLEEP V1 is documented at http://specs.okfnlabs.org/sleep. - -### SLEEP Files - -SLEEP is a set of 9 files that should be stored with the following names. In Dat, the files are stored in a folder called `.dat` in the top level of the repository. - -``` -metadata.key -metadata.signatures -metadata.bitfield -metadata.tree -metadata.data -content.key -content.signatures -content.bitfield -content.tree -``` - -The files prefixed with `content` store metadata about the primary data in a Dat repository, for example the raw binary contents of the files. The files prefixed with `metadata` store metadata about the files in the repository, for example the filenames, file sizes, and file permissions. The `content` and `metadata` files are both Hypercore feeds, making SLEEP simply a set of two Hypercore feeds. +Hypercore registers include an optional `data` file that stores all chunks of data. In Dat, only the `metadata.data` file is used, but the `content.data` file is not used. The default behavior is to store the current files only as normal files. If you want to run an 'archival' node that keeps all previous versions, you can configure Dat to use the `content.data` file instead. For example, on a shared server with lots of storage you probably want to store all versions. However on a workstation machine that is only accessing a subset of one version, the default mode of storing all metadata plus the current set of downloaded files is acceptable, because you know the server has the full history. ### Merkle Trees -Registers in Dat use a specific method of encoding a Merkle tree where hashes are positioned by a scheme called binary in-order interval numbering or just simply "bin" numbering. This is just a specific, deterministic way of laying out the nodes in a tree. For example a tree with 7 nodes will always be arranged like this: +Registers in Dat use a specific method of encoding a Merkle tree where hashes are positioned by a scheme called binary in-order interval numbering or just "bin" numbering. This is just a specific, deterministic way of laying out the nodes in a tree. For example a tree with 7 nodes will always be arranged like this: ``` 0 @@ -201,6 +169,120 @@ For example a register with two data entries would look something like this (pse 2. hash(value1) ``` +It is possible for the in-order Merkle tree to have multiple roots at once. A root is defined as a parent node with a full set of child node slots filled below it. + +For example, this tree hash 2 roots (1 and 4) + +``` +0 + 1 +2 + +4 +``` + +This tree hash one root (3): + +``` +0 + 1 +2 + 3 +4 + 5 +6 +``` + +This one has one root (1): + +``` +0 + 1 +2 +``` + +### Replication Example + +This section describes in high level the replication flow of a Dat. Note that the low level details are available by reading the SLEEP section below. For the sake of illustrating how this works in practice in a networked replication scenario, consider a folder with two files: + +``` +bat.jpg +cat.jpg +``` + +To send these files to another machine using Dat, you would first add them to a Dat repository by splitting them into chunks and constructing SLEEP files representing the chunks and filesystem metadata. + +Let's assume `cat.jpg` produces three chunks, and `bat.jpg` produces four chunks, each around 64KB. Dat stores in a representation called SLEEP, but here we will show a pseudo-representation for the purposes of illustrating the replication process. The seven chunks get sorted into a list like this: + +``` +bat-1 +bat-2 +bat-3 +cat-1 +cat-2 +cat-3 +``` + +These chunks then each get hashed, and the hashes get arranged into a Merkle tree (the content register): + +``` +0 - hash(bat-1) + 1 - hash(0 + 2) +2 - hash(bat-2) + 3 - hash(1 + 5) +4 - hash(bat-3) + 5 - hash(4 + 6) +6 - hash(cat-1) +8 - hash(cat-2) + 9 - hash() +10 - hash(cat-3) +``` + +Next we calculate the root hashes of our tree, in this case 3 and 9. We then hash them together, and cryptographically sign the hash. This signed hash now can be used to verify all nodes in the tree, and the signature proves it was produced by us, the holder of the private key for this Dat. + +This tree is for the hashes of the contents of the photos. There is also a second Merkle tree that Dat generates that represents the list of files and their metadata and looks something like this (the metadata register): + +``` +0 - hash({contentRegister: '9e29d624...'}) + 1 - hash(0 + 2) +2 - hash({"bat.png", first: 0, last: 3}) +4 - hash({"cat.png", first: 3, last: 3}) +``` + +The first entry in this feed is a special metadata entry that tells Dat the address of the second feed (the content register). Note that node 3 is not included yet, because 3 is the hash of `1 + 5`, but 5 does not exist yet, so will be written at a later update. + +Now we're ready to send our metadata to the other peer. The first message is a `Register` message with the key that was shared for this Dat. Let's call ourselves Alice and the other peer Bob. Alice sends Bob a `Want` message that declares they want all nodes in the file list (the metadata register). Bob replies with a single `Have` message that indicates he has 2 nodes of data. Alice sends three `Request` messages, one for each leaf node (`0, 2, 4`). Bob sends back three `Data` messages. The first `Data` message contains the content register key, the hash of the sibling, in this case node `2`, the hash of the uncle root `4`, as well as a signature for the root hashes (in this case `1, 4`). Alice verifies the integrity of this first `Data` message by hashing the metadata received for the content register metadata to produce the hash for node `0`. They then hash the hash `0` with the hash `2` that was included to reproduce hash `1`, and hashes their `1` with the value for `4` they received which they can use the signature they received to verify it was the same data. When the next `Data` message is received, a similar process is performed to verify the content. + +Now Alice has the full list of files in the Dat, but decides they only want to download `cat.png`. Alice knows they want blocks 3 through 6 from the content register. First Alice sends another `Register` message with the content key to open a new replication channel over the connection. Then Alice sends three `Request` messages, one for each of blocks `4, 5, 6`. Bob sends back three `Data` messages with the data for each block, as well as the hashes needed to verify the content in a way similar to the process described above for the metadata feed. + +## 3. SLEEP Specification + +This section is a technical description of the SLEEP format intended for implementers. SLEEP is the the on-disk format that Dat produces and uses. It is a set of 9 files that hold all of the metadata needed to list the contents of a Dat repository and verify the integrity of the data you receive. SLEEP is designed to work with REST, allowing servers to be plain HTTP file servers serving the static SLEEP files, meaning you can implement a Dat protocol client using HTTP with a static HTTP file server as the backend. + +SLEEP files contain metadata about the data inside a Dat repository, including cryptographic hashes, cryptographic signatures, filenames and file permissions. The SLEEP format is specifically engineered to allow efficient access to subsets of the metadat and/or data in the repository, even on very large repositories, which enables Dat's peer to peer networking to be fast. + +The acronym SLEEP is a slumber related pun on REST and stands for Syncable Lightweight Event Emitting Persistence. The Event Emitting part refers to how SLEEP files are append-only in nature, meaning they grow over time and new updates can be subscribed to as a realtime feed of events through the Dat protocol. + +The SLEEP version described here, used in Dat as of 2017 is SLEEP V2. SLEEP V1 is documented at http://specs.okfnlabs.org/sleep. + +### SLEEP Files + +SLEEP is a set of 9 files that should be stored with the following names. In Dat, the files are stored in a folder called `.dat` in the top level of the repository. + +``` +metadata.key +metadata.signatures +metadata.bitfield +metadata.tree +metadata.data +content.key +content.signatures +content.bitfield +content.tree +``` + +The files prefixed with `content` store metadata about the primary data in a Dat repository, for example the raw binary contents of the files. The files prefixed with `metadata` store metadata about the files in the repository, for example the filenames, file sizes, and file permissions. The `content` and `metadata` files are both Hypercore registers, making SLEEP a set of two Hypercore registers. + ### SLEEP File Headers The following structured binary format is used for `signatures`, `bitfield`, and `tree` files. The header contains metadata as well as information needed to decode the rest of the files after the header. SLEEP files are designed to be easy to append new data to at the end, easy to read arbitrary byte offsets in the middle, and are relatively flat, simple files that rely on the filesystem for the heavy lifting. @@ -231,7 +313,7 @@ BLAKE2b To calculate the offset of some entry position, first read the header and get the entry size, then do `32 + entrySize * entryIndex`. To calculate how many entries are in a file, you can use the entry size and the filesize on disk and do `(fileSize - 32) / entrySize`. -As mentioned above, `signatures`, `bitfield` and `tree` are the three SLEEP files. There are two additional files, `key`, and `data`, which do not contain SLEEP file headers and store plain serialized data for easy access. `key` stores the public key that is described by the `signatures` file, and `data` stores the raw block data that the `tree` file contains the hashes and metadata for. +As mentioned above, `signatures`, `bitfield` and `tree` are the three SLEEP files. There are two additional files, `key`, and `data`, which do not contain SLEEP file headers and store plain serialized data for easy access. `key` stores the public key that is described by the `signatures` file, and `data` stores the raw chunk data that the `tree` file contains the hashes and metadata for. ### File Descriptions @@ -315,47 +397,13 @@ Then we take this 32 byte hash and write it to the tree as 40 bytes like this: left child length + right child length ``` -The reason the tree entries contain data lengths is to allow for sparse mode replication. Encoding lengths (and including lengths in all hashes) means you can verify the merkle subtrees independent of the rest of the tree, which happens during sparse replication scenarios (diagram would be useful here). +The reason the tree entries contain data lengths is to allow for sparse mode replication. Encoding lengths (and including lengths in all hashes) means you can verify the merkle subtrees independent of the rest of the tree, which happens during sparse replication scenarios. The tree file corresponds directly to the `data` file. -##### Merkle roots - -It is possible for the in-order Merkle tree to have multiple roots at once. A root is defined as a parent node with a full set of child node slots filled below it. - -For example, this tree hash 2 roots (1 and 4) - -``` -0 - 1 -2 - -4 -``` - -This tree hash one root (3): - -``` -0 - 1 -2 - 3 -4 - 5 -6 -``` - -This one has one root (1): - -``` -0 - 1 -2 -``` - #### data -The `data` file is only included in the SLEEP format for the `metadata.*` prefixed files which contains filesystem metadata and not actual file data. For the `content.*` files, the data is stored externally (in Dat it is stored as normal files on the filesystem and not in a SLEEP file). However you can configure Dat to use a `content.data` file if you want and it will still work. +The `data` file is only included in the SLEEP format for the `metadata.*` prefixed files which contains filesystem metadata and not actual file data. For the `content.*` files, the data is stored externally (in Dat it is stored as normal files on the filesystem and not in a SLEEP file). However you can configure Dat to use a `content.data` file if you want and it will still work. If you want to store the full history of all versions of all files, using the `content.data` file would provide that guarantee, but would have the disadvantage of storing files as chunks merged into one huge file (not as user friendly). The `data` file does not contain a SLEEP file header. It just contains a bunch of concatenated data entries. Entries are written in the same order as they appear in the `tree` file. To read a `data` file, first decode the `tree` file and for every leaf in the `tree` file you can calculate a data offset for the data described by that leaf node in the `data` file. @@ -373,16 +421,16 @@ For example, if we wanted to seek to a specific entry offset (say entry 42): ##### Byte Lookup -The above method illustrates how to resolve a block position index to a byte offset. You can also do the reverse operation, resolving a byte offset to a block position index. This is used to stream arbitrary random access regions of files in sparse replication scenarios. +The above method illustrates how to resolve a chunk position index to a byte offset. You can also do the reverse operation, resolving a byte offset to a chunk position index. This is used to stream arbitrary random access regions of files in sparse replication scenarios. - First, you start by calculating the current Merkle roots - Each node in the tree (including these root nodes) stores the aggregate file size of all byte sizes of the nodes below it. So the roots cumulatively will describe all possible byte ranges for this repository. - Find the root that contains the byte range of the offset you are looking for and get the node information for all of that nodes children using the Index Lookup method, and recursively repeat this step until you find the lowest down child node that describes this byte range. -- The block described by this child node will contain the byte range you are looking for. You can use the `byteOffset` property in the `Stat` metadata object to seek into the right position in the content for the start of this block. +- The chunk described by this child node will contain the byte range you are looking for. You can use the `byteOffset` property in the `Stat` metadata object to seek into the right position in the content for the start of this chunk. ##### Metadata Overhead -Using this scheme, if you write 4GB of data using on average 64KB data chunks (note: chunks can be variable length and do not need to be the same size), your tree file will be around 5MB (0.00125% overhead). +Using this scheme, if you write 4GB of data using on average 64KB data chunks (note: chunks can be variable length and do not need to be the same size), your tree file will be around 5MB (0.0125% overhead). #### signatures @@ -407,7 +455,7 @@ Ed25519 sign( BLAKE2b( <1 byte> 2 // root type - for (every root node) { + for (every root node left-to-right) { <32 byte root hash> <8 byte Uint64BE root tree index> <8 byte Uint64BE child byte lengths> @@ -443,7 +491,7 @@ Each entry contains three objects: - Tree Bitfield (2048 bytes) - 1 bit for every tree entry (all nodes in `tree`) - Bitfield Index (256 bytes) - This is an index of the Data Bitfield that makes it efficient to figure out which pieces of data are missing from the Data Bitfield without having to do a linear scan. -The Data Bitfield is 1Kb somewhat arbitrarily, but the idea is that because most filesystems work in 4Kb block sizes, we can fit the Data, Tree and Index in less then 4Kb of data for efficient writes to the filesystem. The Tree and Index sizes are based on the Data size (the Tree has twice the entries as the Data, odd and even nodes vs just even nodes in `tree`, and Index is always 1/4th the size). +The Data Bitfield is 1Kb somewhat arbitrarily, but the idea is that because most filesystems work in 4Kb chunk sizes, we can fit the Data, Tree and Index in less then 4Kb of data for efficient writes to the filesystem. The Tree and Index sizes are based on the Data size (the Tree has twice the entries as the Data, odd and even nodes vs just even nodes in `tree`, and Index is always 1/4th the size). To generate the Index, you pairs of 2 bytes at a time from the Data Bitfield, check if all bites in the 2 bytes are the same, and generate 4 bits of Index metadata for every 2 bytes of Data (hence how 1024 bytes of Data ends up as 256 bytes of Index). @@ -455,8 +503,6 @@ if (data is all 0's) then [0,0] if (data is not all the same) then [1, 0] ``` -In the above scheme, the first bit means all bits in the data byte are the same, second bit states which bit they are. Note that `[0, 1]` is unused/reserved for future use. - The Index itself is an in-order binary tree, not a traditional bitfield. To generate the tree, you take the tuples you generate above and then write them into a tree like the following example, where non-leaf nodes are generated using the above scheme by looking at the results of the relative even child tuples for each odd parent tuple: ``` @@ -489,7 +535,7 @@ If you write 4GB of data using on average 64KB data chunk size, your bitfield wi This file is used to store content described by the rest of the `metadata.*` hypercore SLEEP files. Whereas the `content.*` SLEEP files describe the data stored in the actual data cloned in the Dat repository filesystem, the `metadata` data feed is stored inside the `.dat` folder along with the rest of the SLEEP files. -The contents of this file is a series of versions of the Dat filesystem tree. As this is a hypercore data feed, it's just an append only log of binary data entries. The challenge is representing a tree in an one dimensional way (append only log). For example, imagine three files: +The contents of this file is a series of versions of the Dat filesystem tree. As this is a hypercore data feed, it's just an append only log of binary data entries. The challenge is representing a tree in an one dimensional way to make it representable as a Hypercore register. For example, imagine three files: ``` ~/dataset $ ls @@ -525,13 +571,13 @@ To do this, we convert the filesystem metadata into entries in a feed like this: ##### Filename Resolution -Each sequence represents adding one of the files to the feed, so at sequence 0 the filesystem state only has a single file, `results.csv` in it. At sequence 1, there are only 2 files added to the feed, and at sequence 3 all files are finally added. The `children` field represents a shorthand way of declaring which other files at every level of the directory hierarchy exist alongside the file being added at that revision. For example at the time of sequence 1, children is `[[0], []]`. The first sub-array, `[0]`, represents the first folder in the `path`, which is the root folder `/`. In this case `[0]` means the root folder at this point in time only has a single file, the file that is the subject of sequence `0`. The second subarray is empty `[]` because there are no other existing files in the second folder in the `path`, `figures`. +Each sequence represents adding one of the files to the register, so at sequence 0 the filesystem state only has a single file, `results.csv` in it. At sequence 1, there are only 2 files added to the register, and at sequence 3 all files are finally added. The `children` field represents a shorthand way of declaring which other files at every level of the directory hierarchy exist alongside the file being added at that revision. For example at the time of sequence 1, children is `[[0], []]`. The first sub-array, `[0]`, represents the first folder in the `path`, which is the root folder `/`. In this case `[0]` means the root folder at this point in time only has a single file, the file that is the subject of sequence `0`. The second subarray is empty `[]` because there are no other existing files in the second folder in the `path`, `figures`. To look up a file by filename, you fetch the latest entry in the log, then use the `children` metadata in that entry to look up the longest common ancestor based on the parent folders of the filename you are querying. You can then recursively repeat this operation until you find the `path` entry you are looking for (or you exhaust all options which means the file does not exist). This is a `O(number of slashes in your path)` operation. -For example, if you wanted to look up `/results.csv` given the above feed, you would start by grabbing the metadata at sequence 2. The longest common ancestor between `/results.csv` and `/figures/graph2` is `/`. You then grab the corresponding entry in the children array for `/`, which in this case is the first entry, `[0]`. You then repeat this with all of the chilren entries until you find a child that is closer to the entry you are looking for. In this example, the first entry happens to be the match we are looking for. +For example, if you wanted to look up `/results.csv` given the above register, you would start by grabbing the metadata at sequence 2. The longest common ancestor between `/results.csv` and `/figures/graph2` is `/`. You then grab the corresponding entry in the children array for `/`, which in this case is the first entry, `[0]`. You then repeat this with all of the chilren entries until you find a child that is closer to the entry you are looking for. In this example, the first entry happens to be the match we are looking for. -You can also perform lookups relative to a point in time by starting from a specific sequence number in the feed. For example to get the state of some file relative to an old sequence number, similar to checking out an old version of a repository in Git. +You can also perform lookups relative to a point in time by starting from a specific sequence number in the register. For example to get the state of some file relative to an old sequence number, similar to checking out an old version of a repository in Git. ##### Data Serialization @@ -545,9 +591,9 @@ The format of the `metadata.data` file is as follows: <Node N> ``` -Each entry in the feed is encoded using Protocol Buffers. +Each entry in the file is encoded using Protocol Buffers. -The first message we write to the feed is of a type called Header which uses this schema: +The first message we write to the file is of a type called Header which uses this schema: ``` message Header { @@ -556,9 +602,9 @@ message Header { } ``` -This is used to declare two pieces of metadata used by Dat. It includes a `type` string with the value `hyperdrive` and `content` binary value that holds the public key of the content feed that this metadata feed represents. When you share a Dat, the metadata key is the main key that gets used, and the content feed key is linked from here in the metadata. +This is used to declare two pieces of metadata used by Dat. It includes a `type` string with the value `hyperdrive` and `content` binary value that holds the public key of the content register that this metadata register represents. When you share a Dat, the metadata key is the main key that gets used, and the content register key is linked from here in the metadata. -After the header the feed will contain many filesystem `Node` entries: +After the header the file will contain many filesystem `Node` entries: ``` message Node { @@ -609,9 +655,9 @@ These are the field defintions: - `uid` - posix user id - `gid` - posix group id - `size` - file size in bytes - - `blocks` - number of data feed entries that make up this file - - `offset` - the data feed entry index for the first block in this file - - `byteOffset` - the data feed file byte offset for the first block in this file + - `blocks` - number of data chunks that make up this file + - `offset` - the data feed entry index for the first chunk in this file + - `byteOffset` - the data feed file byte offset for the first chunk in this file - `mtime` - posix modified_at time - `mtime` - posix created_at time @@ -619,7 +665,7 @@ These are the field defintions: The SLEEP format is designed to allow for sparse replication, meaning you can efficiently download only the metadata and data required to resolve a single byte region of a single file, which makes Dat suitable for a wide variety of streaming, real time and large dataset use cases. -To take advantage of this, Dat includes a network protocol. It is message based and stateless, making it possible to implement on a variety of network transport protocols including UDP and TCP. Both metadata and content feeds in SLEEP share the exact same replication protocol. +To take advantage of this, Dat includes a network protocol. It is message based and stateless, making it possible to implement on a variety of network transport protocols including UDP and TCP. Both metadata and content registers in SLEEP share the exact same replication protocol. Individual messages are encoded using Protocol Buffers and there are ten message types using the following schema: @@ -637,15 +683,15 @@ The `header` value is a single varint that has two pieces of information, the in To generate this varint, you bitshift the 4-bit type integer onto the end of the channel identifier, e.g. `channel << 4 | <4-bit-type>`. -### Feed +### Register Type 0, should be the first message sent on a channel. -- `discoveryKey` - A BLAKE2b keyed hash of the string 'hypercore' using the public key of the metadata feed as the key. +- `discoveryKey` - A BLAKE2b keyed hash of the string 'hypercore' using the public key of the metadata register as the key. - `nonce` - 32 bytes of random binary data, used in our encryption scheme ``` -message Feed { +message Register { required bytes discoveryKey = 1; optional bytes nonce = 2; } @@ -653,7 +699,7 @@ message Feed { ### Handshake -Type 1. Overall connection handshake. Should be sent just after the feed message on the first channel only (metadata). +Type 1. Overall connection handshake. Should be sent just after the register message on the first channel only (metadata). - `id` - 32 byte random data used as a identifier for this peer on the network, useful for checking if you are connected to yourself or another peer more than once - `live` - Whether or not you want to operate in live (continuous) replication mode or end after the initial sync @@ -680,9 +726,9 @@ message Status { ### Have -Type 3. How you tell the other peer what blocks of data you have or don't have. You should only send Have messages to peers who have expressed interest in this region with Want messages. +Type 3. How you tell the other peer what chunks of data you have or don't have. You should only send Have messages to peers who have expressed interest in this region with Want messages. -- `start` - If you only specify `start`, it means you are telling the other side you only have 1 block at the position at the value in `start`. +- `start` - If you only specify `start`, it means you are telling the other side you only have 1 chunk at the position at the value in `start`. - `length` - If you specify length, you can describe a range of values that you have all of, starting from `start`. - `bitfield` - If you would like to send a range of sparse data about haves/don't haves via bitfield, relative to `start`. @@ -715,7 +761,7 @@ uncompressed-sequence = varint( ### Unhave -Type 4. How you communicate that you deleted or removed a block you used to have. +Type 4. How you communicate that you deleted or removed a chunk you used to have. ``` @@ -727,7 +773,7 @@ message Unhave { ### Want -Type 5. How you ask the other peer to subscribe you to Have messages for a region of blocks. The `length` value defaults to Infinity or feed.length (if not live). +Type 5. How you ask the other peer to subscribe you to Have messages for a region of chunks. The `length` value defaults to Infinity or feed.length (if not live). ``` message Want { @@ -738,7 +784,7 @@ message Want { ### Unwant -Type 6. How you ask to unsubscribe from Have messages for a region of blocks from the other peer. You should only Unwant previously Wanted regions, but if you do Unwant something that hasn't been Wanted it won't have any effect. The `length` value defaults to Infinity or feed.length (if not live). +Type 6. How you ask to unsubscribe from Have messages for a region of chunks from the other peer. You should only Unwant previously Wanted regions, but if you do Unwant something that hasn't been Wanted it won't have any effect. The `length` value defaults to Infinity or feed.length (if not live). ``` message Unwant { @@ -749,11 +795,11 @@ message Unwant { ### Request -Type 7. Request a single block of data. +Type 7. Request a single chunk of data. -- `index` - The block index for the block you want. You should only ask for indexes that you have received the Have messages for. -- `bytes` - You can also optimistically specify a byte offset, and in the case the remote is able to resolve the block for this byte offset depending on their Merkle tree state, they will ignore the `index` and send the block that resolves for this byte offset instead. But if they cannot resolve the byte request, `index` will be used. -- `hash` - If you only want the hash of the block and not the block data itself. +- `index` - The chunk index for the chunk you want. You should only ask for indexes that you have received the Have messages for. +- `bytes` - You can also optimistically specify a byte offset, and in the case the remote is able to resolve the chunk for this byte offset depending on their Merkle tree state, they will ignore the `index` and send the chunk that resolves for this byte offset instead. But if they cannot resolve the byte request, `index` will be used. +- `hash` - If you only want the hash of the chunk and not the chunk data itself. - `nodes` - A 64 bit long bitfield representing which parent nodes you have. The `nodes` bitfield is an optional optimization to reduce the amount of duplicate nodes exchanged during the replication lifecycle. It indicates which parents you have or don't have. You have a maximum of 64 parents you can specify. Because `uint64` in Protocol Buffers is implemented as a varint, over the wire this does not take up 64 bits in most cases. The first bit is reserved to signify whether or not you need a signature in response. The rest of the bits represent whether or not you have (`1`) or don't have (`0`) the information at this node already. The ordering is determined by walking parent, sibling up the tree all the way to the root. @@ -781,13 +827,13 @@ message Cancel { ### Data -Type 9. Sends a single block of data to the other peer. You can send it in response to a Request or unsolicited on it's own as a friendly gift. The data includes all of the Merkle tree parent nodes needed to verify the hash chain all the way up to the Merkle roots for this block. Because you can produce the direct parents by hashing the block, only the roots and 'uncle' hashes are included (the siblings to all of the parent nodes). +Type 9. Sends a single chunk of data to the other peer. You can send it in response to a Request or unsolicited on it's own as a friendly gift. The data includes all of the Merkle tree parent nodes needed to verify the hash chain all the way up to the Merkle roots for this chunk. Because you can produce the direct parents by hashing the chunk, only the roots and 'uncle' hashes are included (the siblings to all of the parent nodes). -- `index` - The block position for this block. -- `value` - The block binary data. Empty if you are sending only the hash. -- `Node.index` - The index for this block in in-order -- `Node.hash` - The hash of this block -- `Node.size`- The aggregate block size for all children below this node (The sum of all block sizes of all children) +- `index` - The chunk position for this chunk. +- `value` - The chunk binary data. Empty if you are sending only the hash. +- `Node.index` - The index for this chunk in in-order notation +- `Node.hash` - The hash of this chunk +- `Node.size`- The aggregate chunk size for all children below this node (The sum of all chunk sizes of all children) - `signature` - If you are sending a root node, all root nodes must have the signature included. @@ -856,4 +902,14 @@ IPFS is a family of application and network protocols that have peer to peer fil The UK Government Digital Service have developed the concept of a register which they define as a digital public ledger you can trust. In the UK government registers are beginning to be piloted as a way to expose essential open data sets in a way where consumers can verify the data has not been tampered with, and allows the data publishers to update their data sets over time. -The design of registers was inspired by the infrastructure backing the Certificate Transparency project, initated at Google, which provides a service on top of SSL certificates that enables service providers to write certificates to a distributed public ledger. Anyone client or service provider can verify if a certificate they received is in the ledger, which protects against so called "rogue certificates".
\ No newline at end of file +The design of registers was inspired by the infrastructure backing the Certificate Transparency project, initated at Google, which provides a service on top of SSL certificates that enables service providers to write certificates to a distributed public ledger. Anyone client or service provider can verify if a certificate they received is in the ledger, which protects against so called "rogue certificates". + +# 6. Reference Implementation + +The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. It provides statistics such as how many sources are currently connected, how many good and bad behaving sources have been talked to, and it automatically handles connecting and reconnecting to sources. UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native). + +Our implementation of source discovery is called [discovery-channel](https://npmjs.org/discovery-channel). We also run a [custom DNS server](https://www.npmjs.com/package/dns-discovery) that Dat clients use (in addition to specifying their own if they need to), as well as a [DHT bootstrap](https://github.com/bittorrent/bootstrap-dht) server. These discovery servers are the only centralized infrastructure we need for Dat to work over the Internet, but they are redundant, interchangeable, never see the actual data being shared, anyone can run their own and Dat will still work even if they all are unavailable. If this happens discovery will just be manual (e.g. manually sharing IP/ports). + +# Acknowledgements + +This work was made possible through grants from the John S. and James L. Knight and Alfred P. Sloan Foundations. diff --git a/papers/dat-paper.pdf b/papers/dat-paper.pdf Binary files differindex 03fc9c4..673d52c 100644 --- a/papers/dat-paper.pdf +++ b/papers/dat-paper.pdf |