From 08bc8948d52ca749cb4fcf0f68f76bd18fa335b4 Mon Sep 17 00:00:00 2001 From: Karissa McKelvey Date: Fri, 1 Apr 2016 15:36:32 -0700 Subject: Update api.md --- api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/api.md b/api.md index e076819..3ecd884 100644 --- a/api.md +++ b/api.md @@ -21,7 +21,7 @@ Create a new link for the contents of a directory and begin automatically servin * `--foreground`: run the server in the foreground instead. -#### `dat download LINK DIR` +#### `dat LINK DIR` Download a link to a directory and begin automatically serving the data to a swarm in the background. -- cgit v1.2.3 From 20e67d4c4ffed610481b80ff559dcf7326e1771c Mon Sep 17 00:00:00 2001 From: Max Ogden Date: Mon, 4 Apr 2016 15:19:28 -0700 Subject: update how-dat-works --- how-dat-works.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/how-dat-works.md b/how-dat-works.md index 8f46cdb..2c4f849 100644 --- a/how-dat-works.md +++ b/how-dat-works.md @@ -8,7 +8,7 @@ When someone starts downloading data with the [Dat command-line tool](https://gi Dat links look like this: `dat.land/c3fcbcdcf03360529b47df32ccfb9bc1d7f64aaaa41cca43ca9ac7f6778db8da`. The domain, dat.land, is there so if someone opens the link in a browser we can provide them with download instructions, and as an easy way for people to visually distinguish and remember Dat links. Dat itself doesn't actually use the dat.land part, it just needs the last part of the link which is a fingerprint of the data that is being shared. The first thing that happens when you go to download data using one of these links is you ask various discovery networks if they can tell you where to find sources that have a copy of the data you need. -The discovery step itself is a simple query: you supply a Dat link, and receive back the IP and port of all the known data sources online that have a copy of that data you are looking for. You can then connect to them and begin exchanging data. By introducing this discovery phase we are able to create a network where data can be discovered even if the original data source disappears. +Source discovery means finding the IP and port of all the known data sources online that have a copy of that data you are looking for. You can then connect to them and begin exchanging data. By introducing this discovery phase we are able to create a network where data can be discovered even if the original data source disappears. The discovery protocols we use are [DNS name servers](https://en.wikipedia.org/wiki/Name_server), [Multicast DNS](https://en.wikipedia.org/wiki/Multicast_DNS) and the [Kademlia Mainline Distributed Hash Table](https://en.wikipedia.org/wiki/Mainline_DHT) (DHT). Each one has pros and cons, so we combine all three to increase the speed and reliability of discovering data sources. @@ -18,17 +18,17 @@ The discovery logic itself is handled by a module that we wrote called [discover ## Phase 2: Source connections -Up until this point we have just done searches to find who has the data we need. Now that we know who should talk to, we have to connect to them. We currently use TCP sockets are our primary transport protocol, and layer on our own file sharing protocol on top. We also have experimental support for [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol) which is designed to *not* take up all available bandwidth on a network (e.g. so that other people sharing your wifi can still use the Internet). We also are working on WebRTC support so we can incorporate Browser and Electron clients for some really open web use cases. +Up until this point we have just done searches to find who has the data we need. Now that we know who should talk to, we have to connect to them. We use either [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) or [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol) sockets for the actual peer to peer connections. UTP is nice because it is designed to *not* take up all available bandwidth on a network (e.g. so that other people sharing your wifi can still use the Internet). We then layer on our own file sharing protocol on top, called [Hypercore](https://github.com/mafintosh/hypercore). We also are working on WebRTC support so we can incorporate Browser and Electron clients for some really open web use cases. When we get the IP and port for a potential source we try to connect using all available protocols (currently TCP and sometimes UTP) and hope one works. If one connects first, we abort the other ones. If none connect, we try again until we decide that source is offline or unavailable to use and we stop trying to connect to them. Sources we are able to connect to go into a list of known good source, so that if our Internet connection goes down we can use that list to reconnect to our good sources again quickly. If we get a lot of potential sources we pick a handful at random to try and connect to and keep the rest around as additional sources to use later in case we decide we need more sources. A lot of these are parameters that we can tune for different scenarios later, but have started with some best guesses as defaults. -The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. You can see stats like how many sources are currently connected, how many good and bad behaving sources you've talked to, and it automatically handles connecting and reconnecting to sources for you. Our experimental UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native), which you can manually install if you want to try it out with Dat. +The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. You can see stats like how many sources are currently connected, how many good and bad behaving sources you've talked to, and it automatically handles connecting and reconnecting to sources for you. Our UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native). ## Phase 3: Data exchange -So now we have found data sources, have connected to them, but we havent yet figured out if they *actually* have the data we need. This is where our file transfer protocol [Hyperdrive](https://www.npmjs.com/package/hyperdrive) comes in. You can read a much longer description of how hyperdrive works in the [Hyperdrive Specification](https://github.com/mafintosh/hyperdrive/blob/master/SPECIFICATION.md). +So now we have found data sources, have connected to them, but we havent yet figured out if they *actually* have the data we need. This is where our file transfer protocol [Hyperdrive](https://www.npmjs.com/package/hyperdrive) comes in. You can read a much longer description of how hyperdrive works in the [Hyperdrive Specification](hyperdrive.md). The short version of how Hyperdrive works is that it breaks file contents up in to pieces, hashes each piece and then constructs a [Merkle tree](https://en.wikipedia.org/wiki/Merkle_tree) out of all of the pieces. This ultimately gives us the Dat link, which is the top level hash of the Merkle tree. -- cgit v1.2.3 From 35d4251dead3335525244c237af81b7aec67f0b2 Mon Sep 17 00:00:00 2001 From: Max Ogden Date: Tue, 5 Apr 2016 12:48:33 -0700 Subject: test githubs gpg signed commits --- how-dat-works.md | 22 +++++++++++++++++++--- hyperdrive.md | 4 ++-- 2 files changed, 21 insertions(+), 5 deletions(-) diff --git a/how-dat-works.md b/how-dat-works.md index 2c4f849..71dba53 100644 --- a/how-dat-works.md +++ b/how-dat-works.md @@ -28,14 +28,30 @@ The connection logic is implemented in a module called [discovery-swarm](https:/ ## Phase 3: Data exchange -So now we have found data sources, have connected to them, but we havent yet figured out if they *actually* have the data we need. This is where our file transfer protocol [Hyperdrive](https://www.npmjs.com/package/hyperdrive) comes in. You can read a much longer description of how hyperdrive works in the [Hyperdrive Specification](hyperdrive.md). +So now we have found data sources, have connected to them, but we havent yet figured out if they *actually* have the data we need. This is where our file transfer protocol [Hyperdrive](https://www.npmjs.com/package/hyperdrive) comes in. -The short version of how Hyperdrive works is that it breaks file contents up in to pieces, hashes each piece and then constructs a [Merkle tree](https://en.wikipedia.org/wiki/Merkle_tree) out of all of the pieces. This ultimately gives us the Dat link, which is the top level hash of the Merkle tree. +The short version of how Hyperdrive works is: It breaks file contents up in to pieces, hashes each piece and then constructs a [Merkle tree](https://en.wikipedia.org/wiki/Merkle_tree) out of all of the pieces. This ultimately gives us the Dat link, which is the top level hash of the Merkle tree. -We use a technique called Rabin fingerprinting to break files up into pieces. Rabin fingerprints are a specific strategy for what is called Content Defined Chunking. Here's an example: +Here's the long version: + +Hyperdrive shares and synchronizes a set of files, similar to rsync or Dropbox. For each file in the drive we use a technique called Rabin fingerprinting to break the file up into pieces. Rabin fingerprints are a specific strategy for what is called Content Defined Chunking. Here's an example: ![cdc diagram](meta/cdc.png) +We have configured our Rabin chunker to produce chunks that are around 16KB on average. So if you share a folder containing a single 1MB JPG you will get around 64 chunks. The idea of CDC is that your chunks only change if + +After feeding the file contents through the chunker, we take the chunks and calculate the SHA256 hash of each one. We then arrange these hashes into a special data structure we developed that we call the Flat In-Order Merkle Tree. + +### Flat In-Order Merkle Tree + +``` + 3 + 1 5 +0 2 4 6 +``` + +Want to go lower level? Check out [How Hypercore Works?](hyperdrive.md#how-hypercore-works) + When two peers connect to each other and begin speaking the Hyperdrive protocol they can efficiently determine if they have chunks the other one wants, and begin exchanging those chunks directly. Hyperdrive gives us the flexibility to have random access to any portion of a file while still verifying the other side isnt sending us bad data. We can also download different sections of files in parallel across all of the sources simultaneously, which increases overall download speed dramatically. ## Phase 4: Data archiving diff --git a/hyperdrive.md b/hyperdrive.md index 999a03c..1636d77 100644 --- a/hyperdrive.md +++ b/hyperdrive.md @@ -18,9 +18,9 @@ The protocol itself draws heavy inspiration from existing file sharing systems s ## How Hypercore works -### Flat Trees +### Flat In-Order Trees -A flat tree is a simple way represent a binary tree as a list. It also allows you to identify every node of a binary tree with a numeric index. Both of these properties makes it useful in distributed applications to simplify wire protocols that uses tree structures. +A Flat In-Order Tree is a simple way represent a binary tree as a list. It also allows you to identify every node of a binary tree with a numeric index. Both of these properties makes it useful in distributed applications to simplify wire protocols that uses tree structures. Flat trees are described in [PPSP RFC 7574 as "Bin numbers"](https://datatracker.ietf.org/doc/rfc7574/?include_text=1) and a node version is available through the [flat-tree](https://github.com/mafintosh/flat-tree) module. -- cgit v1.2.3 From 2d12228213baa9af81f4e4b21c59d2597b9a39d8 Mon Sep 17 00:00:00 2001 From: Max Ogden Date: Tue, 5 Apr 2016 12:49:23 -0700 Subject: try again --- how-dat-works.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/how-dat-works.md b/how-dat-works.md index 71dba53..0e76667 100644 --- a/how-dat-works.md +++ b/how-dat-works.md @@ -38,7 +38,7 @@ Hyperdrive shares and synchronizes a set of files, similar to rsync or Dropbox. ![cdc diagram](meta/cdc.png) -We have configured our Rabin chunker to produce chunks that are around 16KB on average. So if you share a folder containing a single 1MB JPG you will get around 64 chunks. The idea of CDC is that your chunks only change if +We have configured our Rabin chunker to produce chunks that are around 16KB on average. So if you share a folder containing a single 1MB JPG you will get around 64 chunks. After feeding the file contents through the chunker, we take the chunks and calculate the SHA256 hash of each one. We then arrange these hashes into a special data structure we developed that we call the Flat In-Order Merkle Tree. -- cgit v1.2.3 From e74d154fee728437e4a0cffb457509ec847e1003 Mon Sep 17 00:00:00 2001 From: Mathias Buus Date: Fri, 8 Apr 2016 04:44:17 +0200 Subject: Update hyperdrive.md --- hyperdrive.md | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) diff --git a/hyperdrive.md b/hyperdrive.md index 1636d77..a1b5ae5 100644 --- a/hyperdrive.md +++ b/hyperdrive.md @@ -257,3 +257,85 @@ So using this digest the person can easily figure out that they only need to sen The bit vector `1` (only contains a single one) means that we already have all the hashes we need so just send us the block. These digests are very compact in size, only `(log2(number-of-blocks) + 2) / 8` bytes needed in the worst case. For example if you are sharing one trillion blocks of data the digest would be `(log2(1000000000000) + 2) / 8 ~= 6` bytes long. + +### Bitfield Run length Encoding + +(talk about rle) + +### Basic Privacy + +(talk about the privacy features + public id here) + +## Hypercore Feeds + +(talk about how we use the above concepts to create a feed of data) + +## Hypercore Replication Protocol + +Lets assume two peers have the identifier for a hypercore feed. This could either be the hash of the merkle tree roots described above or a public key if they want to share a signed merkle tree. The two peers wants to exchange the data verified by this tree. Lets assume the two peers have somehow connected to each other. + +Hypercore uses a message based protocol to exchange data. All messages sent are encoded to binary values using Protocol Buffers. Protocol Buffers are a widely supported schema based encoding support. A Protocol Buffers implementation is available on npm through the [protocol-buffers](https://github.com/mafintosh/protocol-buffers) module. + +These are the types of messages the peers send to each other + +#### Open + +This should be the first message sent and is also the only message without a type. It looks like this + +``` proto +message Open { + required bytes publicId = 1; + required bytes nonce = 2; +} +``` + +The `publicId` should be set to the public id of the Merkle Tree as specified above. The `nonce` should be set to 24 bytes of random data. When running in encrypted mode this is the only message sent unencrypted. + +#### `0` Handshake + +The message contains the protocol handshake. It has type `0`. + +``` proto +message Handshake { + optional uint64 version = 1; + required bytes peerId = 2; + repeated string extensions = 3; +} +``` + +You should send this message after sending an open message. By sending it after an open message it will be encrypted and we wont expose our peer id to a third party. The current protocol version is 0. + +#### `1` Have + +You can send a have message to give the other peer information about which blocks of data you have. It has type `1`. + +``` proto +message Have { + required uint64 start = 1; + optional uint64 end = 2; + optional bytes bitfield = 3; +} +``` + +If using a bitfield it should be encoded using a run length encoding described above. It is a good idea to send a have message soon as possible if you have blocks to share to reduce latency. + +#### `2` Want + +You can send a have message to give the other peer information about which blocks of data you want to have. It has type `2`. + +``` proto +message Want { + required uint64 start = 1; + optional uint64 end = 2; +} +``` + +You should only send the want message if you are interesting in a section of the feed that the other peer has not told you about. + +#### `3` Request +#### `4` Response +#### `5` Cancel +#### `6` Pause +#### `7` Resume +#### `8` Close + -- cgit v1.2.3 From 39a0791dfd350b10f7ff8540762431cf993e8e3a Mon Sep 17 00:00:00 2001 From: Karissa McKelvey Date: Mon, 11 Apr 2016 14:44:58 -0700 Subject: Update api.md --- api.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/api.md b/api.md index 3ecd884..a92e20f 100644 --- a/api.md +++ b/api.md @@ -33,6 +33,10 @@ Starts serving dats in the background progress. * `--foreground`: run the server in the foreground instead. +#### `dat status` + +View a static list of the current dat links that are served. + #### `dat stop` Stops serving dats. @@ -41,10 +45,6 @@ Stops serving dats. Remove a link from the list, stops serving it. -#### `dat ls` - -View a static list of the current dat links that are served. - #### `dat mon` Opens up real-time monitoring panel for viewing progress of running dats. Can provide optional parameter `dat mon LINK` to filter the monitor and logs for a given dat link. -- cgit v1.2.3 From 4ae288b201d29fce777a2b0d4bd67fcbd2b733da Mon Sep 17 00:00:00 2001 From: Karissa McKelvey Date: Mon, 11 Apr 2016 14:45:38 -0700 Subject: Update api.md --- api.md | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/api.md b/api.md index a92e20f..e3bcbb3 100644 --- a/api.md +++ b/api.md @@ -1,12 +1,9 @@ ## 1.0 Architecture Design -![dat-arch.001.jpg](arch.png) - * dat: command-line * dat-desk: desktop application - * dat-server: http and ui frontend - * dat-js: JS api + * dat-server: daemon dat manager * hyperdrive: storage layer * discovery-swarm: swarm @@ -83,9 +80,9 @@ linker.on('progress', function (progress) { }) ``` -#### `dat.download(link, dir, cb)` +#### `dat.join(link, dir, cb)` -Download the given link to a given location. Get progress events from the stream. Progress events are the same as emitted by the `dat` object. +Joins the swarm for a given link to a given location. Get progress events from the stream. Progress events are the same as emitted by the `dat` object. ```js var done = function (err) { @@ -100,10 +97,6 @@ downloader.on('progress', function (progress) { }) ``` -#### `dat.join(link, cb)` - -Join a swarm for the given link. Should be called after `link` or `download`. Throws error if data has not been downloaded or linked. - #### `dat.leave(link, cb)` Leave the swarm for the given link. -- cgit v1.2.3 From f1f30ba7678cb08f8b3ffcf2b92f623732c946b0 Mon Sep 17 00:00:00 2001 From: Max Ogden Date: Fri, 22 Apr 2016 19:40:02 -0700 Subject: add first draft of meta.dat --- diy-dat.md | 2 +- meta.dat.md | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 61 insertions(+), 1 deletion(-) create mode 100644 meta.dat.md diff --git a/diy-dat.md b/diy-dat.md index 77a644d..f95eb11 100644 --- a/diy-dat.md +++ b/diy-dat.md @@ -15,7 +15,7 @@ var Level = require('level') // the dat link someone sent us, we want to download the data from it var link = new Buffer(process.argv[2], 'hex') -// here are the default config dat uses: +// here is the default config dat uses // used for MDNS and also as the dns 'app name', you prob shouldnt change this var DAT_DOMAIN = 'dat.local' // dat will try this first and pick the first open port if its taken diff --git a/meta.dat.md b/meta.dat.md new file mode 100644 index 0000000..7cc8db6 --- /dev/null +++ b/meta.dat.md @@ -0,0 +1,60 @@ +# meta.dat + +Dat uses a simple metadata file called `meta.dat`. The purpose of this file is to store the fingerprints of the files in a Dat repository. If you create a `meta.dat` file for a set of files, you can host it on a static HTTP server along with the files and Dat clients will be able to download and verify your files, even if you aren't running a Dat server! + +# File format + +``` +
+``` + +The format is a header followed by many entries. Entry order is based on the indexing determined by the [Flat In-Order Tree](hyperdrive.md#flat-in-order-trees) algorithm we use in Dat. + +For example, given a tree like this you might want to look up in a `meta.dat` file the metadata for a specific node: + +``` +0─┐ + 1─┐ +2─┘ │ + 3 +4─┐ │ + 5─┘ +6─┘ +``` + +If you wanted to look up the metadata for 3, you could read the third (or any!) entry from meta.dat with the following formula (assuming you've already decoded the header and hash lengths from the beginning of the file): + +``` +header-length + entry-number * (8 + hash-length) +``` + +### Header format + +``` +
+``` + +The header protobuf has this schema: + +``` proto +message Header { + dat link + is-signed-feed + hash-type + hash-length +} +``` + +### Entry format + +For non-signed entries: + +``` +<8-byte-chunk-offset> +``` + +For signed entries (in live feeds): + +``` +<8-byte-chunk-offset><64-byte-signature> +``` -- cgit v1.2.3 From 9697d0e83af88e1331ffc36a1d42289163bf9d75 Mon Sep 17 00:00:00 2001 From: Mathias Buus Date: Sat, 23 Apr 2016 12:28:04 +0200 Subject: add types to proto file --- meta.dat.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/meta.dat.md b/meta.dat.md index 7cc8db6..5600f95 100644 --- a/meta.dat.md +++ b/meta.dat.md @@ -38,10 +38,10 @@ The header protobuf has this schema: ``` proto message Header { - dat link - is-signed-feed - hash-type - hash-length + required bytes datLink = 1; + optional isLive = 2; + optional string hashType = 3 [default = "sha256"]; + optional uint32 hashLength = 4 [default = 32]; } ``` -- cgit v1.2.3 From 78ec3b96e4d993f71f7033c729ee802cebd2362d Mon Sep 17 00:00:00 2001 From: Mathias Buus Date: Sat, 23 Apr 2016 12:28:42 +0200 Subject: missing bool --- meta.dat.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/meta.dat.md b/meta.dat.md index 5600f95..81bf06c 100644 --- a/meta.dat.md +++ b/meta.dat.md @@ -39,7 +39,7 @@ The header protobuf has this schema: ``` proto message Header { required bytes datLink = 1; - optional isLive = 2; + optional bool isLive = 2; optional string hashType = 3 [default = "sha256"]; optional uint32 hashLength = 4 [default = 32]; } -- cgit v1.2.3 From c6826d4d1603c5a57c8be0f1b53d1710f466aac7 Mon Sep 17 00:00:00 2001 From: Mathias Buus Date: Sat, 23 Apr 2016 20:37:47 +0200 Subject: live -> signed --- meta.dat.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/meta.dat.md b/meta.dat.md index 81bf06c..51ab30d 100644 --- a/meta.dat.md +++ b/meta.dat.md @@ -39,7 +39,7 @@ The header protobuf has this schema: ``` proto message Header { required bytes datLink = 1; - optional bool isLive = 2; + optional bool isSigned = 2; optional string hashType = 3 [default = "sha256"]; optional uint32 hashLength = 4 [default = 32]; } -- cgit v1.2.3 From 26ad91fe7851620b6884e6680fe54a9b26b5d48f Mon Sep 17 00:00:00 2001 From: Mathias Buus Date: Sat, 23 Apr 2016 21:04:25 +0200 Subject: add javascript examples --- meta.dat.md | 82 +++++++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 61 insertions(+), 21 deletions(-) diff --git a/meta.dat.md b/meta.dat.md index 51ab30d..b54af6b 100644 --- a/meta.dat.md +++ b/meta.dat.md @@ -10,24 +10,6 @@ Dat uses a simple metadata file called `meta.dat`. The purpose of this file is t The format is a header followed by many entries. Entry order is based on the indexing determined by the [Flat In-Order Tree](hyperdrive.md#flat-in-order-trees) algorithm we use in Dat. -For example, given a tree like this you might want to look up in a `meta.dat` file the metadata for a specific node: - -``` -0─┐ - 1─┐ -2─┘ │ - 3 -4─┐ │ - 5─┘ -6─┘ -``` - -If you wanted to look up the metadata for 3, you could read the third (or any!) entry from meta.dat with the following formula (assuming you've already decoded the header and hash lengths from the beginning of the file): - -``` -header-length + entry-number * (8 + hash-length) -``` - ### Header format ``` @@ -42,6 +24,8 @@ message Header { optional bool isSigned = 2; optional string hashType = 3 [default = "sha256"]; optional uint32 hashLength = 4 [default = 32]; + optional string signatureType = 5 [default = "ed25519"]; + optional uint32 signatureLength = 6 [default = 64]; } ``` @@ -50,11 +34,67 @@ message Header { For non-signed entries: ``` -<8-byte-chunk-offset> +<8-byte-chunk-end> ``` -For signed entries (in live feeds): +The 8-byte-chunk-end is an unsigned big endian 64 bitd integer that should be the absolute position in the file for the **end of the chunk**.. +For signed entries in live feeds (only applies to even numbered nodes e.g. leaf nodes): + +``` +<8-byte-chunk-end> ``` -<8-byte-chunk-offset><64-byte-signature> + +For any odd nodes, in either a live or a non-live feed, the non-signed entry format will be used. + +## Example + +Given a tree like this you might want to look up in a `meta.dat` file the metadata for a specific node: + +``` +0─┐ + 1─┐ +2─┘ │ + 3 +4─┐ │ + 5─┘ +6─┘ +``` + +If you wanted to look up the metadata for 3, you could read the third (or any!) entry from meta.dat: + +First you have to read the varint at the beginning of the file so you know how big the header is: + +``` js +var varint = require('varint') // https://github.com/chrisdickinson/varint +var headerLength = varint.decode(firstChunkOfFile) +``` + +Now you can read the header from the file + +``` js +var headerOffset = varint.encodingLength(headerLength) +var headerEndOffset = headerOffset + headerLength +var headerBytes = firstChunkOfFile.slice(headerOffset, headerEndOffset) +``` + +To decode the header use the protobuf schema. We can use the [protocol-buffers](https://github.com/mafintosh/protocol-buffers) module to do that. + +``` js +var messages = require('protocol-buffers')(fs.readFileSync('meta.dat.proto')) +var header = messages.Header.decode(headerBytes) +``` + +Now we have all the configuration required to calculate an entry offset. + +``` js +var entryNumber = 42 +var entryOffset = headerEndOffset + entryNumber * (8 + header.hashLength) +``` + +If you have a signed feed, you have to take into account the extra space required for the signatures in the even nodes. + +``` js +var entryOffset = headerLength + entryNumber * (8 + header.hashLength) + + Math.floor(entryNumber / 2) * header.signatureLength ``` -- cgit v1.2.3 From 3f25fb9d483d415d78f4945b2863c5d99bca6229 Mon Sep 17 00:00:00 2001 From: Mathias Buus Date: Sat, 23 Apr 2016 22:43:10 +0200 Subject: update with entries inlined --- meta.dat.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/meta.dat.md b/meta.dat.md index b54af6b..6d51634 100644 --- a/meta.dat.md +++ b/meta.dat.md @@ -5,10 +5,10 @@ Dat uses a simple metadata file called `meta.dat`. The purpose of this file is t # File format ``` -
+
``` -The format is a header followed by many entries. Entry order is based on the indexing determined by the [Flat In-Order Tree](hyperdrive.md#flat-in-order-trees) algorithm we use in Dat. +The format is a header followed by an index of many entries. Entry order is based on the indexing determined by the [Flat In-Order Tree](hyperdrive.md#flat-in-order-trees) algorithm we use in Dat. After the entry index, a concatinated list of entries follows. ### Header format @@ -21,15 +21,16 @@ The header protobuf has this schema: ``` proto message Header { required bytes datLink = 1; - optional bool isSigned = 2; - optional string hashType = 3 [default = "sha256"]; - optional uint32 hashLength = 4 [default = 32]; - optional string signatureType = 5 [default = "ed25519"]; - optional uint32 signatureLength = 6 [default = 64]; + required uint64 entries = 2; + optional bool isSigned = 3; + optional string hashType = 4 [default = "sha256"]; + optional uint32 hashLength = 5 [default = 32]; + optional string signatureType = 6 [default = "ed25519"]; + optional uint32 signatureLength = 7 [default = 64]; } ``` -### Entry format +### Entry index format For non-signed entries: @@ -37,7 +38,7 @@ For non-signed entries: <8-byte-chunk-end> ``` -The 8-byte-chunk-end is an unsigned big endian 64 bitd integer that should be the absolute position in the file for the **end of the chunk**.. +The 8-byte-chunk-end is an unsigned big endian 64 bit integer that should be the absolute position in the file for the **end of the chunk**. For signed entries in live feeds (only applies to even numbered nodes e.g. leaf nodes): -- cgit v1.2.3 From b9341322b043a445c9eeea194895b79fc75443ba Mon Sep 17 00:00:00 2001 From: Max Ogden Date: Sat, 7 May 2016 11:15:04 +0200 Subject: start dat paper --- paper.md | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) create mode 100644 paper.md diff --git a/paper.md b/paper.md new file mode 100644 index 0000000..869f021 --- /dev/null +++ b/paper.md @@ -0,0 +1,39 @@ +# Dat + +## Distributed Dataset Synchronization And Versioning + +Draft 1 +Maxwell Ogden +max@maxogden.com +2016 + +## ABSTRACT + +Dat is a swarm based version control system designed for sharing large datasets over networks such that their contents can be accessed randomly, be updated incrementally, and have the integrity of their contents be trusted. Every Dat user is simultaneously a server and a client exchanging pieces of data with other peers in a swarm on demand. As data is added to a Dat repository updated files are split into pieces based on Rabin fingerprinting and deduplicated against known pieces to avoid retransmission of data. File contents are automatically verified using secure hashes meaning you do not need to trust other nodes. + +## 1. INTRODUCTION + +There are countless ways to share share datasets over the Internet today. The simplest and most widely used approach, sharing files over HTTP, is subject to dead links when files are moved or deleted, as HTTP has no concept of history or versioning built in. E-mailing datasets as attachments is also widely used, and has the concept of history built in, but many email providers limit the maximum attachment size which makes it impractical for many datasets. + +Cloud storage services like S3 ensure availability of data, but as they have a centralized hub-and-spoke networking model tend to be limited by their bandwidth, meaning popular files can be come very expensive to share. Services like Dropbox and Google Drive provide version control and synchronization on top of cloud storage services which fixes many issues with broken links but rely on proprietary code and infrastructure requiring users to store their data on cloud infrastructure which has implications on cost, transfer speeds, and user privacy. + +Distributed file sharing tools like BitTorrent become faster as files become more popular, removing the bandwidth bottleneck and making file distribution effectively free. They also implement discovery systems which fix the broken link issue meaning if the original source goes offline other backup sources can be automatically discovered. However P2P file sharing tools today are not supported by Web browsers and do not provide a mechanism for updating files without redistributing a new dataset which could mean entire redownloading data you already have. + +Decentralized version control tools for source code like Git provide a protocol for efficiently downloading changes to a set of files, but are optimized for text files and have issues with large files. Solutions like Git-LFS solve this by using HTTP to download large files, rather than the Git protocol. GitHub offers Git-LFS hosting but charges repository owners for bandwidth on popular files. Building a peer to peer distribution layer for files in a Git repository is difficult due to design of Git Packfiles which are delta compressed repository states that do not support random access to byte ranges in previous file versions. + +One case study is science. Increasingly scientific datasets are being provided online using one of the above approaches, and cited in published literature. Broken links and systems that do not provide version checking or content addressability of data directly limit the reproducibility of scientific analyses based on shared datasets. Services that charge a premium for bandwidth cause monetary and data transfer strain on the users sharing the data, who are often on fast public university networks with effectively unlimited bandwidth. Version control tool designed for text files do not keep up with the demands of large data analysis in science today. + +## 2. INSPIRATION + +- git +- lbfs +- bittorrent +- webtorrent +- ipfs + +## 3. DESIGN + +- mirroring +- reproducibility +- parallel downloading +- incremental updates -- cgit v1.2.3 From f8f12c6de61270d587089256f62b3114cf29aa1f Mon Sep 17 00:00:00 2001 From: Max Ogden Date: Sun, 8 May 2016 11:35:56 +0200 Subject: inspiration section --- paper.md | 52 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 46 insertions(+), 6 deletions(-) diff --git a/paper.md b/paper.md index 869f021..9365b09 100644 --- a/paper.md +++ b/paper.md @@ -21,15 +21,55 @@ Distributed file sharing tools like BitTorrent become faster as files become mor Decentralized version control tools for source code like Git provide a protocol for efficiently downloading changes to a set of files, but are optimized for text files and have issues with large files. Solutions like Git-LFS solve this by using HTTP to download large files, rather than the Git protocol. GitHub offers Git-LFS hosting but charges repository owners for bandwidth on popular files. Building a peer to peer distribution layer for files in a Git repository is difficult due to design of Git Packfiles which are delta compressed repository states that do not support random access to byte ranges in previous file versions. -One case study is science. Increasingly scientific datasets are being provided online using one of the above approaches, and cited in published literature. Broken links and systems that do not provide version checking or content addressability of data directly limit the reproducibility of scientific analyses based on shared datasets. Services that charge a premium for bandwidth cause monetary and data transfer strain on the users sharing the data, who are often on fast public university networks with effectively unlimited bandwidth. Version control tool designed for text files do not keep up with the demands of large data analysis in science today. +Science is an example of an important community that would benefit from better approaches in this area. Increasingly scientific datasets are being provided online using one of the above approaches, and cited in published literature. Broken links and systems that do not provide version checking or content addressability of data directly limit the reproducibility of scientific analyses based on shared datasets. Services that charge a premium for bandwidth cause monetary and data transfer strain on the users sharing the data, who are often on fast public university networks with effectively unlimited bandwidth. Version control tool designed for text files do not keep up with the demands of large data analysis in science today. ## 2. INSPIRATION -- git -- lbfs -- bittorrent -- webtorrent -- ipfs +Dat is inspired by a number of features from existing systems. + +### Git + +Git popularized the idea of a Merkle DAG, a way to represent changes to data where each change is addressed by the secure hash of the change plus all previous hashes. This provides a way to trust data integrity, as the only way a specific hash could be derived by another peer is if they have the same data and change history required to reproduce that hash. This is important for reproducibility as it lets you trust that a specific git commit hash refers to a specific source code state. + +### LBFS + +LBFS is a networked file system that avoids transferring redundant data by deduplicating common regions of files and only transferring unique regions once. The deduplication algorithm they use is called Rabin fingerprinting and works by hashing the contents of the file using a sliding window and looking for content defined chunk boundaries that probabilistically appear at the desired byte offsets (e.g. every 1kb). + +Content defined chunking has the benefit of being shift resistant, meaning if you insert a byte into the middle of a file only the first chunk boundary to the right of the insert will change, but all other boundaries will remain the same. With a fixed size chunking strategy, such as the one used by rsync, all chunk boundaries to the right of the insert will be shifted by one byte, meaning half of the chunks of the file would need to be retransmitted. + +### BitTorrent + +BitTorrent implements a swarm based file sharing protocol for static datasets. Data is split into fixed sized chunks, hashed, and then that hash is used to discover peers that have the same data. An advantage of using BitTorrent for dataset transfers is that download speeds can be saturated. Since the file is split into pieces, and peers can efficiently discover which pieces each of the peers they are connected to have, it means one peer can download non-overlapping regions of the dataset from many peers at the same time in parallel, maximizing network throughput. + +Fixed sized chunking has drawbacks for data that changes (see LBFS above). Additionally, BitTorrent always divides data into 1024 pieces, meaning large datasets could have a very large chunk size which impacts random access performance (e.g. for streaming video over BitTorrent). + +### Kademlia Distributed Hash Table + +Kademlia is a distributed hash table, in other words a distributed key/value store that can serve a similar purpose to DNS servers but has no hard coded server addresses. All clients in Kademlia are also servers. As long as you know at least one address of another peer in the network, you can ask them for the key you are trying to find and they will either have it or give you some other people to talk to that are more likely to have it. + +If you don't have an initial peer to talk to you have to use something like a bootstrap server that just randomly gives you a peer in the network to start with. If the bootstrap server goes down, the network still functions, and other methods can be used to bootstrap new peers (such as sending them peer addresses through side channels like how .torrent files include tracker addresses to try in case Kademlia finds no peers). + +Kademlia is distinct from previous DHT designs such as Chord due to its simplicity. It uses a very simple XOR operation between two keys as it's distance metric to decide which peers are closer to the data being searched for. On paper it seems like it wouldn't work, as it doesn't take into account things like ping speed or bandwidth. Instead it's design is very simple on purpose, to minimize the amount of control/gossip messages, and to minimize the amount of complexity required to implement it. In practice Kademlia has been extremely successful and is widely deployed as the "Mainline DHT" for BitTorrent, with support in all popular BitTorrent clients today. + +### Peer to Peer Streaming Peer Protocol (PPSPP) + +PPSPP ([IETF RFC 7574](https://datatracker.ietf.org/doc/rfc7574/?include_text=1)) is a protocol for live streaming content over a peer to peer network. In it they define a specific type of Merkle Tree that allows for subsets of the hashes to be requested by a peer in order to reduce the time-till-playback for end users. BitTorrent for example transfers all hashes up front, which is not suitable for live streaming. + +Their Merkle trees are ordered using a scheme they call "bin numbering", which is a method for deterministically arranging an append-only log of leaf nodes into an in-order layout tree where non-leaf nodes are derived hashes. If you want to verify a specific node, you only need to request its sibling's hash and all its uncle hashes. PPSPP is very concerned with reducing round trip time and time-till-playback by allowing for many kinds of optimizations to pack as many hashes into datagrams as possible when exchanging tree information with peers. + +The ability to request a subset of metadata from a large and/or streaming dataset is very desirable for the Dat use case. + +### WebTorrent + +With WebRTC, browsers can now make peer to peer connections directly to other browsers. BitTorrent uses UDP sockets which aren't available to browser JavaScript, so can't be used as-is on the Web. + +WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC as the transport. This includes the BitTorrent block exchange protocol as well as the tracker protocol implemented in a way that can enable hybrid nodes, talking simultaneously to both BitTorrent and WebTorrent swarms (if a peer is capable of making both UDP sockets as well as WebRTC sockets). Trackers are exposed to web clients over HTTP or WebSockets. In a normal web browser you can only use WebRTC to exchange data with peers. + +### InterPlanetary File System + +IPFS also builds on many of the concepts from this section and presents a new platform similar in scope to the Web that has content integrity, peer to peer file sharing, version history and data permanence baked in as a sort of upgrade to the current Web. Whereas Dat is one application of these ideas that is specifically focused on sharing datasets but is agnostic to what platform it is built on, IPFS goes lower level and abstracts network sockets and naming systems so that any application built on the Web can alternatively be built on IPFS to inherit it's properties, as long as their hyperlinks can be expressed as content addressed addresses to the IPFS global Merkle DAG. + +The research behind IPFS has coalesced many of these ideas into a more accessible format. ## 3. DESIGN -- cgit v1.2.3 From 22c75baea185200cb0032adc76f0f33448a3f2aa Mon Sep 17 00:00:00 2001 From: Max Ogden Date: Sun, 8 May 2016 13:53:37 +0200 Subject: more sections of paper --- how-dat-works.md | 2 +- paper.md | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 65 insertions(+), 8 deletions(-) diff --git a/how-dat-works.md b/how-dat-works.md index 0e76667..9921f62 100644 --- a/how-dat-works.md +++ b/how-dat-works.md @@ -20,7 +20,7 @@ The discovery logic itself is handled by a module that we wrote called [discover Up until this point we have just done searches to find who has the data we need. Now that we know who should talk to, we have to connect to them. We use either [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) or [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol) sockets for the actual peer to peer connections. UTP is nice because it is designed to *not* take up all available bandwidth on a network (e.g. so that other people sharing your wifi can still use the Internet). We then layer on our own file sharing protocol on top, called [Hypercore](https://github.com/mafintosh/hypercore). We also are working on WebRTC support so we can incorporate Browser and Electron clients for some really open web use cases. -When we get the IP and port for a potential source we try to connect using all available protocols (currently TCP and sometimes UTP) and hope one works. If one connects first, we abort the other ones. If none connect, we try again until we decide that source is offline or unavailable to use and we stop trying to connect to them. Sources we are able to connect to go into a list of known good source, so that if our Internet connection goes down we can use that list to reconnect to our good sources again quickly. +When we get the IP and port for a potential source we try to connect using all available protocols (currently TCP and sometimes UTP) and hope one works. If one connects first, we abort the other ones. If none connect, we try again until we decide that source is offline or unavailable to use and we stop trying to connect to them. Sources we are able to connect to go into a list of known good sources, so that if our Internet connection goes down we can use that list to reconnect to our good sources again quickly. If we get a lot of potential sources we pick a handful at random to try and connect to and keep the rest around as additional sources to use later in case we decide we need more sources. A lot of these are parameters that we can tune for different scenarios later, but have started with some best guesses as defaults. diff --git a/paper.md b/paper.md index 9365b09..455bfc2 100644 --- a/paper.md +++ b/paper.md @@ -9,7 +9,7 @@ max@maxogden.com ## ABSTRACT -Dat is a swarm based version control system designed for sharing large datasets over networks such that their contents can be accessed randomly, be updated incrementally, and have the integrity of their contents be trusted. Every Dat user is simultaneously a server and a client exchanging pieces of data with other peers in a swarm on demand. As data is added to a Dat repository updated files are split into pieces based on Rabin fingerprinting and deduplicated against known pieces to avoid retransmission of data. File contents are automatically verified using secure hashes meaning you do not need to trust other nodes. +Dat is a swarm based version control system designed for sharing large datasets over networks such that their contents can be accessed randomly, be updated incrementally, and have the integrity of their contents be trusted. Every Dat user is simultaneously a server and a client exchanging pieces of data with other peers in a swarm on demand. As data is added to a Dat repository updated files are split into pieces based on Rabin fingerprinting and deduplicated against known pieces to avoid retransmission of data. File contents are automatically verified using secure hashes meaning you do not need to trust other nodes. ## 1. INTRODUCTION @@ -41,7 +41,7 @@ Content defined chunking has the benefit of being shift resistant, meaning if yo BitTorrent implements a swarm based file sharing protocol for static datasets. Data is split into fixed sized chunks, hashed, and then that hash is used to discover peers that have the same data. An advantage of using BitTorrent for dataset transfers is that download speeds can be saturated. Since the file is split into pieces, and peers can efficiently discover which pieces each of the peers they are connected to have, it means one peer can download non-overlapping regions of the dataset from many peers at the same time in parallel, maximizing network throughput. -Fixed sized chunking has drawbacks for data that changes (see LBFS above). Additionally, BitTorrent always divides data into 1024 pieces, meaning large datasets could have a very large chunk size which impacts random access performance (e.g. for streaming video over BitTorrent). +Fixed sized chunking has drawbacks for data that changes (see LBFS above). Additionally, BitTorrent assumes all metadata will be transferred up front, and most clients divide data into 1024 pieces, meaning large datasets could have a very large chunk size which impacts random access performance (e.g. for streaming video over BitTorrent). ### Kademlia Distributed Hash Table @@ -69,11 +69,68 @@ WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC as the IPFS also builds on many of the concepts from this section and presents a new platform similar in scope to the Web that has content integrity, peer to peer file sharing, version history and data permanence baked in as a sort of upgrade to the current Web. Whereas Dat is one application of these ideas that is specifically focused on sharing datasets but is agnostic to what platform it is built on, IPFS goes lower level and abstracts network sockets and naming systems so that any application built on the Web can alternatively be built on IPFS to inherit it's properties, as long as their hyperlinks can be expressed as content addressed addresses to the IPFS global Merkle DAG. -The research behind IPFS has coalesced many of these ideas into a more accessible format. +The research behind IPFS has coalesced many of these ideas into a more accessible format. We are still exploring how to best implement the Dat protocol on top of the IPFS platform. ## 3. DESIGN -- mirroring -- reproducibility -- parallel downloading -- incremental updates +Dat is a file sharing protocol that does not assume a dataset is static or that the entire dataset will be downloaded. The protocol is agnostic to the underlying transport, e.g. you could implement Dat over carrier pigeon. The key properties of the Dat design are explained in this section. + +- 1. **Mirroring** - All participants in the network simultaneously share and consume data. +- 2. **Content Integrity** - Data and publisher integrity is verified through use of signed content addressable hashes +- 3. **Parallel transfer** - Subsets of the data can be accessed from multiple peers simultaneously, improving transfer speeds +- 4. **Streaming updates** - Datasets can be updated and distributed in real time to downstream peers +- 5. **Secure Metadata** - Dat employs a capability system whereby anyone with a Dat link can connect to the swarm, but the link itself is a secure hash that is nearly impossible to guess + +## 3.1 Mirroring + +Dat is a peer to peer protocol designed to exchange pieces of a dataset amongst a swarm of peers. When a peer acquires their first piece of data in the dataset, they are now a partial mirror for the dataset. If someone else contacts them and needs the piece they have, they can share it. This can happen simultaneously while the peer is still downloading the pieces they want. + +### 3.1.1 Source Discovery + +An important aspect of mirring is source discovery, the techniques that peers use to find each other. Source discovery means finding the IP and port of data sources online that have a copy of that data you are looking for. You can then connect to them and begin exchanging data using the Dat file exchange protocol, Hypercore. By using source discovery techniques we are able to create a network where data can be discovered even if the original data source disappears. + +Source discovery can happen over many kinds of networks, as long as you can model the following actions: + +- `join(key, [port])` - Begin performing regular lookups on an interval for `key`. Specify `port` if you want to announce that you share `key` as well. +- `leave(key, [port])` - Stop looking for `key`. Specify `port` to stop announcing that you share `key` as well. +- `foundpeer(key, ip, port)` - Called when a peer is found by a lookup + +In the Dat implementation we implement the above actions on top of three types of discovery networks: + +- DNS name servers - An Internet standard mechanism for resolving keys to addresses +- Multicast DNS - Useful for discovering peers on local networks +- Kademlia Mainline Distributed Hash Table - Zero point of failure, increases probability of Dat working even if DNS servers are unreachable + +Additional discovery networks can be implemented as needed. We chose the above three as a starting point to have a complementary mix of strategies to increase the probability of source discovery. + +Our implementation of peer discovery is called discovery-channel. We also run a [custom DNS server](https://www.npmjs.com/package/dns-discovery) that Dat clients use (in addition to specifying their own if they need to), as well as a [DHT bootstrap](https://github.com/bittorrent/bootstrap-dht) server. These discovery servers are the only centralized infrastructure we need for Dat to work over the Internet, but they are redundant, interchangeable, never see the actual data being shared, anyone can run their own and Dat will still work even if they all are unavailable. If this happens discovery will just be manual (e.g. manually sharing IP/ports). Every data source that has a copy of the data also advertises themselves across these discovery networks. + +### 3.1.2 Peer Connections + +Up until this point we have just done searches to find who has the data we need. Now that we know who should talk to, we have to connect to them. Once we have a duplex binary connection to a peer we then layer on our own file sharing protocol on top, called [Hypercore](https://github.com/mafintosh/hypercore). + +In our implementation, we use either [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol), [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol) or WebRTC sockets for the actual peer to peer connections. UTP is nice because it is designed to *not* take up all available bandwidth on a network (e.g. so that other people sharing your wifi can still use the Internet). WebRTC support makes Dat work in modern web browsers using peer to peer connections. + +When we get the IP and port for a potential source we try to connect using all available protocols and hope one works. If one connects first, we abort the other ones. If none connect, we try again until we decide that source is offline or unavailable to use and we stop trying to connect to them. Sources we are able to connect to go into a list of known good sources, so that if our Internet connection goes down we can use that list to reconnect to our good sources again quickly. + +If we get a lot of potential sources we pick a handful at random to try and connect to and keep the rest around as additional sources to use later in case we decide we need more sources. A lot of these are parameters that we can tune for different scenarios later, but have started with some best guesses as defaults. + +The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. You can see stats like how many sources are currently connected, how many good and bad behaving sources you've talked to, and it automatically handles connecting and reconnecting to sources for you. Our UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native). + +So now we have found data sources, have connected to them, but we havent yet figured out if they *actually* have the data we need. This is where our file transfer protocol [Hyperdrive](https://www.npmjs.com/package/hyperdrive) comes in. This is explained in a later section. + +Peer connections types are outside the scope of the Dat protocol, but in the Dat implementation we make a best effort to make as successful connections using our default types as possible. This means employing peer to peer connection techniques like UDP hole punching. Our approach to hole punching is to use a central known server, in our case it is our DNS server, which is accessible on the public Internet. + +In a scenario where two peers A and B want to connect, and both know the central server: + +1. Peer A creates a local UDP socket and messages the central server that it is interested in connecting to people. +2. Central server messages Peer A back with a token that is a `hash(Peer A's remote IP + a local secret)`. The UDP packet contains the remote IP. +3. Peer A messages the central server with the token (this way you cannot spoof your IP and DDOS a remote peer) +4. Peer B does the same. +5. When the central server receives Peer B's message that it wants to connect to peers it forwards Peer B's message to Peer A and Peer A's message to Peer B. +6. Both peers now send a message to each other on their public IP and port. If UDP hole punching is supported by the routers of both peers, one of the messages should get through. +7. At this point we reuse the UDP socket to run UTP on top to get a streaming reliable interface. + +## 3.2 Content Integrity + +Content integrity means being able to verify the data you received is the exact same version of the data that you expected. This is important for reproducibility, as -- cgit v1.2.3 From 956a5e17b03b4d779e72d3c430126af9817b16b2 Mon Sep 17 00:00:00 2001 From: Max Ogden Date: Mon, 9 May 2016 12:31:25 +0200 Subject: add pdf generation --- dat-paper.md | 135 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ dat-paper.pdf | Bin 0 -> 197850 bytes package.json | 20 +++++++++ paper.md | 136 ---------------------------------------------------------- 4 files changed, 155 insertions(+), 136 deletions(-) create mode 100644 dat-paper.md create mode 100644 dat-paper.pdf create mode 100644 package.json delete mode 100644 paper.md diff --git a/dat-paper.md b/dat-paper.md new file mode 100644 index 0000000..99a08ef --- /dev/null +++ b/dat-paper.md @@ -0,0 +1,135 @@ +# Abstract + +Dat is a swarm based version control system designed for sharing large datasets over networks such that their contents can be accessed randomly, be updated incrementally, and have the integrity of their contents be trusted. Every Dat user is simultaneously a server and a client exchanging pieces of data with other peers in a swarm on demand. As data is added to a Dat repository updated files are split into pieces based on Rabin fingerprinting and deduplicated against known pieces to avoid retransmission of data. File contents are automatically verified using secure hashes meaning you do not need to trust other nodes. + +# 1. Introduction + +There are countless ways to share share datasets over the Internet today. The simplest and most widely used approach, sharing files over HTTP, is subject to dead links when files are moved or deleted, as HTTP has no concept of history or versioning built in. E-mailing datasets as attachments is also widely used, and has the concept of history built in, but many email providers limit the maximum attachment size which makes it impractical for many datasets. + +Cloud storage services like S3 ensure availability of data, but as they have a centralized hub-and-spoke networking model tend to be limited by their bandwidth, meaning popular files can be come very expensive to share. Services like Dropbox and Google Drive provide version control and synchronization on top of cloud storage services which fixes many issues with broken links but rely on proprietary code and infrastructure requiring users to store their data on cloud infrastructure which has implications on cost, transfer speeds, and user privacy. + +Distributed file sharing tools like BitTorrent become faster as files become more popular, removing the bandwidth bottleneck and making file distribution effectively free. They also implement discovery systems which prevents broken links meaning if the original source goes offline other backup sources can be automatically discovered. However P2P file sharing tools today are not supported by Web browsers and do not provide a mechanism for updating files without redistributing a new dataset which could mean entire redownloading data you already have. + +Decentralized version control tools for source code like Git provide a protocol for efficiently downloading changes to a set of files, but are optimized for text files and have issues with large files. Solutions like Git-LFS solve this by using HTTP to download large files, rather than the Git protocol. GitHub offers Git-LFS hosting but charges repository owners for bandwidth on popular files. Building a peer to peer distribution layer for files in a Git repository is difficult due to design of Git Packfiles which are delta compressed repository states that do not support random access to byte ranges in previous file versions. + +Science is an example of an important community that would benefit from better approaches in this area. Increasingly scientific datasets are being provided online using one of the above approaches and cited in published literature. Broken links and systems that do not provide version checking or content addressability of data directly limit the reproducibility of scientific analyses based on shared datasets. Services that charge a premium for bandwidth cause monetary and data transfer strain on the users sharing the data, who are often on fast public university networks with effectively unlimited bandwidth. Version control tools designed for text files do not keep up with the demands of large data analysis in science today. + +# 2. Inspiration + +Dat is inspired by a number of features from existing systems. + +## 2.1 Git + +Git popularized the idea of a Merkle DAG, a way to represent changes to data where each change is addressed by the secure hash of the change plus all previous hashes. This provides a way to trust data integrity, as the only way a specific hash could be derived by another peer is if they have the same data and change history required to reproduce that hash. This is important for reproducibility as it lets you trust that a specific git commit hash refers to a specific source code state. + +## 2.2 LBFS + +LBFS is a networked file system that avoids transferring redundant data by deduplicating common regions of files and only transferring unique regions once. The deduplication algorithm they use is called Rabin fingerprinting and works by hashing the contents of the file using a sliding window and looking for content defined chunk boundaries that probabilistically appear at the desired byte offsets (e.g. every 1kb). + +Content defined chunking has the benefit of being shift resistant, meaning if you insert a byte into the middle of a file only the first chunk boundary to the right of the insert will change, but all other boundaries will remain the same. With a fixed size chunking strategy, such as the one used by rsync, all chunk boundaries to the right of the insert will be shifted by one byte, meaning half of the chunks of the file would need to be retransmitted. + +## 2.3 BitTorrent + +BitTorrent implements a swarm based file sharing protocol for static datasets. Data is split into fixed sized chunks, hashed, and then that hash is used to discover peers that have the same data. An advantage of using BitTorrent for dataset transfers is that download bandwidth can be fully used. Since the file is split into pieces, and peers can efficiently discover which pieces each of the peers they are connected to have, it means one peer can download non-overlapping regions of the dataset from many peers at the same time in parallel, maximizing network throughput. + +Fixed sized chunking has drawbacks for data that changes (see LBFS above). BitTorrent assumes all metadata will be transferred up front which makes it impractical for streaming or updating content. Most BitTorrent clients divide data into 1024 pieces meaning large datasets could have a very large chunk size which impacts random access performance (e.g. for streaming video). + +## 2.4 Kademlia Distributed Hash Table + +Kademlia is a distributed hash table, in other words a distributed key/value store that can serve a similar purpose to DNS servers but has no hard coded server addresses. All clients in Kademlia are also servers. As long as you know at least one address of another peer in the network, you can ask them for the key you are trying to find and they will either have it or give you some other people to talk to that are more likely to have it. + +If you don't have an initial peer to talk to you have to use something like a bootstrap server that just randomly gives you a peer in the network to start with. If the bootstrap server goes down, the network still functions, and other methods can be used to bootstrap new peers (such as sending them peer addresses through side channels like how .torrent files include tracker addresses to try in case Kademlia finds no peers). + +Kademlia is distinct from previous DHT designs such as Chord due to its simplicity. It uses a very simple XOR operation between two keys as its distance metric to decide which peers are closer to the data being searched for. On paper it seems like it wouldn't work as it doesn't take into account things like ping speed or bandwidth. Instead its design is very simple on purpose to minimize the amount of control/gossip messages and to minimize the amount of complexity required to implement it. In practice Kademlia has been extremely successful and is widely deployed as the "Mainline DHT" for BitTorrent, with support in all popular BitTorrent clients today. + +## 2.5 Peer to Peer Streaming Peer Protocol (PPSPP) + +PPSPP ([IETF RFC 7574](https://datatracker.ietf.org/doc/rfc7574/?include_text=1)) is a protocol for live streaming content over a peer to peer network. In it they define a specific type of Merkle Tree that allows for subsets of the hashes to be requested by a peer in order to reduce the time-till-playback for end users. BitTorrent for example transfers all hashes up front, which is not suitable for live streaming. + +Their Merkle trees are ordered using a scheme they call "bin numbering", which is a method for deterministically arranging an append-only log of leaf nodes into an in-order layout tree where non-leaf nodes are derived hashes. If you want to verify a specific node, you only need to request its sibling's hash and all its uncle hashes. PPSPP is very concerned with reducing round trip time and time-till-playback by allowing for many kinds of optimizations, such as to pack as many hashes into datagrams as possible when exchanging tree information with peers. + +Although PPSPP was designed with streaming video in mind, the ability to request a subset of metadata from a large and/or streaming dataset is very desirable for many other types of datasets. + +## 2.6 WebTorrent + +With WebRTC browsers can now make peer to peer connections directly to other browsers. BitTorrent uses UDP sockets which aren't available to browser JavaScript, so can't be used as-is on the Web. + +WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC as the transport. This includes the BitTorrent block exchange protocol as well as the tracker protocol implemented in a way that can enable hybrid nodes, talking simultaneously to both BitTorrent and WebTorrent swarms (if a client is capable of making both UDP sockets as well as WebRTC sockets, such as Node.js). Trackers are exposed to web clients over HTTP or WebSockets. + +## 2.7 InterPlanetary File System + +IPFS also builds on many of the concepts from this section and presents a new platform similar in scope to the Web that has content integrity, peer to peer file sharing, version history and data permanence baked in as a sort of upgrade to the current Web. Whereas Dat is one application of these ideas that is specifically focused on sharing datasets but is agnostic to what platform it is built on, IPFS goes lower level and abstracts network protocols and naming systems so that any application built on the Web can alternatively be built on IPFS to inherit it's properties, as long as their hyperlinks can be expressed as content addressed addresses to the IPFS global Merkle DAG. + +The research behind IPFS has coalesced many of these ideas into a more accessible format. We are still exploring how to best implement the Dat protocol on top of the IPFS platform. + +# 3. DESIGN + +Dat is a file sharing protocol that does not assume a dataset is static or that the entire dataset will be downloaded. The protocol is agnostic to the underlying transport e.g. you could implement Dat over carrier pigeon. The key properties of the Dat design are explained in this section. + +- 1. **Mirroring** - All participants in the network simultaneously share and consume data. +- 2. **Content Integrity** - Data and publisher integrity is verified through use of signed hashes of the content. +- 3. **Parallel Transfer** - Subsets of the data can be accessed from multiple peers simultaneously, improving transfer speeds. +- 4. **Streaming Updates** - Datasets can be updated and distributed in real time to downstream peers. +- 5. **Secure Metadata** - Dat employs a capability system whereby anyone with a Dat link can connect to the swarm, but the link itself is a secure hash that is nearly impossible to guess and is never leaked by Dat itself. + +## 3.1 Mirroring + +Dat is a peer to peer protocol designed to exchange pieces of a dataset amongst a swarm of peers. As soon as a peer acquires their first piece of data in the dataset they become a partial mirror for the dataset. If someone else contacts them and needs the piece they have, they can share it. This can happen simultaneously while the peer is still downloading the pieces they want. + +### 3.1.1 Source Discovery + +An important aspect of mirroring is source discovery, the techniques that peers use to find each other. Source discovery means finding the IP and port of data sources online that have a copy of that data you are looking for. You can then connect to them and begin exchanging data using the Dat file exchange protocol, Hypercore. By using source discovery techniques we are able to create a network where data can be discovered even if the original data source disappears. + +Source discovery can happen over many kinds of networks, as long as you can model the following actions: + +- `join(key, [port])` - Begin performing regular lookups on an interval for `key`. Specify `port` if you want to announce that you share `key` as well. +- `leave(key, [port])` - Stop looking for `key`. Specify `port` to stop announcing that you share `key` as well. +- `foundpeer(key, ip, port)` - Called when a peer is found by a lookup + +In the Dat implementation we implement the above actions on top of three types of discovery networks: + +- DNS name servers - An Internet standard mechanism for resolving keys to addresses +- Multicast DNS - Useful for discovering peers on local networks +- Kademlia Mainline Distributed Hash Table - Zero point of failure, increases probability of Dat working even if DNS servers are unreachable + +Additional discovery networks can be implemented as needed. We chose the above three as a starting point to have a complementary mix of strategies to increase the probability of source discovery. + +Our implementation of peer discovery is called discovery-channel. We also run a [custom DNS server](https://www.npmjs.com/package/dns-discovery) that Dat clients use (in addition to specifying their own if they need to), as well as a [DHT bootstrap](https://github.com/bittorrent/bootstrap-dht) server. These discovery servers are the only centralized infrastructure we need for Dat to work over the Internet, but they are redundant, interchangeable, never see the actual data being shared, anyone can run their own and Dat will still work even if they all are unavailable. If this happens discovery will just be manual (e.g. manually sharing IP/ports). Every data source that has a copy of the data also advertises themselves across these discovery networks. + +### 3.1.2 Peer Connections + +Up until this point we have just done searches to find who has the data we need. Now that we know who should talk to, we have to connect to them. Once we have a duplex binary connection to a peer we then layer on our own file sharing protocol on top, called [Hypercore](https://github.com/mafintosh/hypercore). + +In our implementation, we use either [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol), [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol) or WebRTC sockets for the actual peer to peer connections. UTP is nice because it is designed to *not* take up all available bandwidth on a network (e.g. so that other people sharing your wifi can still use the Internet). WebRTC support makes Dat work in modern web browsers using peer to peer connections. + +When we get the IP and port for a potential source we try to connect using all available protocols and hope one works. If one connects first, we abort the other ones. If none connect, we try again until we decide that source is offline or unavailable to use and we stop trying to connect to them. Sources we are able to connect to go into a list of known good sources, so that if our Internet connection goes down we can use that list to reconnect to our good sources again quickly. + +If we get a lot of potential sources we pick a handful at random to try and connect to and keep the rest around as additional sources to use later in case we decide we need more sources. A lot of these are parameters that we can tune for different scenarios later, but have started with some best guesses as defaults. + +The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. You can see stats like how many sources are currently connected, how many good and bad behaving sources you've talked to, and it automatically handles connecting and reconnecting to sources for you. Our UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native). + +So now we have found data sources, connected to them, but we haven't yet figured out if they *actually* have the data we need. This is where our file transfer protocol [Hyperdrive](https://www.npmjs.com/package/hyperdrive) comes in. This is explained in a later section. + +Peer connections types are outside the scope of the Dat protocol, but in the Dat implementation we make a best effort to make as many successful connections using our default types as possible. This means employing peer to peer connection techniques like UDP hole punching [?]. Our approach for UDP hole punching is to use a central known hole punching server which is accessible on the public Internet. In our implementation we re-use our custom DNS server by adding to it special functionality to facilitate peer message exchange for the purpose of hole punching. + +In a scenario where two peers A and B want to connect, and both know the central server, this is how we perform UDP hole punching: + +1. Peer A creates a local UDP socket and messages the central server that it is interested in connecting to people. +2. Central server messages Peer A back with a token that is a `hash(Peer A's remote IP + a local secret)`. The UDP packet contains the remote IP. +3. Peer A messages the central server with the token (this way you cannot spoof your IP and DDOS a remote peer) +4. Peer B does the same. +5. When the central server receives Peer B's message that it wants to connect to peers it forwards Peer B's message to Peer A and Peer A's message to Peer B. +6. Both peers now send a message to each other on their public IP and port. If UDP hole punching is supported by the routers of both peers at least one of the messages should get through. +7. At this point we reuse the UDP socket to run UTP on top to get a streaming reliable interface. + +## 3.2 Content Integrity + +Content integrity means being able to verify the data you received is the exact same version of the data that you expected. This is imporant in a distributed system as this mechanism will catch incorrect data sent by bad peers. It also has implications for reproducibility as it lets you refer to a specific version of the dataset you want. + +A common issue in data analysis is when data changes but the link to the data remains the same. For example, one day a file called data.zip might change, but a simple HTTP link to the file does not include a hash of the content, so clients that only have the HTTP link have no way to check if the file changed. Looking up a file by the hash of its content is called content addressability, and lets users not only verify that the data they receive is the version of the data they want, but also lets people cite specific versions of the data by referring to a specific hash. + +## 3.3 Parallel Transfer + +## 3.4 Streaming Updates + +## 3.5 Secure Metadata diff --git a/dat-paper.pdf b/dat-paper.pdf new file mode 100644 index 0000000..e6ef10c Binary files /dev/null and b/dat-paper.pdf differ diff --git a/package.json b/package.json new file mode 100644 index 0000000..beb8c10 --- /dev/null +++ b/package.json @@ -0,0 +1,20 @@ +{ + "name": "dat-docs", + "version": "1.0.0", + "description": "Documentation for Dat and the surrounding ecosystem.", + "main": "index.js", + "scripts": { + "paper": "pandoc --variable author=\"Maxwell Ogden, Karissa McKelvey, Mathias Buus\" --variable title=\"Dat - Distributed Dataset Synchronization And Versioning\" --variable date=\"Version 1.0.0, May 2016\" --variable classoption=twocolumn --variable papersize=a4paper -s dat-paper.md -o dat-paper.pdf" + }, + "repository": { + "type": "git", + "url": "git+https://github.com/datproject/docs.git" + }, + "keywords": [], + "author": "", + "license": "ISC", + "bugs": { + "url": "https://github.com/datproject/docs/issues" + }, + "homepage": "https://github.com/datproject/docs#readme" +} diff --git a/paper.md b/paper.md deleted file mode 100644 index 455bfc2..0000000 --- a/paper.md +++ /dev/null @@ -1,136 +0,0 @@ -# Dat - -## Distributed Dataset Synchronization And Versioning - -Draft 1 -Maxwell Ogden -max@maxogden.com -2016 - -## ABSTRACT - -Dat is a swarm based version control system designed for sharing large datasets over networks such that their contents can be accessed randomly, be updated incrementally, and have the integrity of their contents be trusted. Every Dat user is simultaneously a server and a client exchanging pieces of data with other peers in a swarm on demand. As data is added to a Dat repository updated files are split into pieces based on Rabin fingerprinting and deduplicated against known pieces to avoid retransmission of data. File contents are automatically verified using secure hashes meaning you do not need to trust other nodes. - -## 1. INTRODUCTION - -There are countless ways to share share datasets over the Internet today. The simplest and most widely used approach, sharing files over HTTP, is subject to dead links when files are moved or deleted, as HTTP has no concept of history or versioning built in. E-mailing datasets as attachments is also widely used, and has the concept of history built in, but many email providers limit the maximum attachment size which makes it impractical for many datasets. - -Cloud storage services like S3 ensure availability of data, but as they have a centralized hub-and-spoke networking model tend to be limited by their bandwidth, meaning popular files can be come very expensive to share. Services like Dropbox and Google Drive provide version control and synchronization on top of cloud storage services which fixes many issues with broken links but rely on proprietary code and infrastructure requiring users to store their data on cloud infrastructure which has implications on cost, transfer speeds, and user privacy. - -Distributed file sharing tools like BitTorrent become faster as files become more popular, removing the bandwidth bottleneck and making file distribution effectively free. They also implement discovery systems which fix the broken link issue meaning if the original source goes offline other backup sources can be automatically discovered. However P2P file sharing tools today are not supported by Web browsers and do not provide a mechanism for updating files without redistributing a new dataset which could mean entire redownloading data you already have. - -Decentralized version control tools for source code like Git provide a protocol for efficiently downloading changes to a set of files, but are optimized for text files and have issues with large files. Solutions like Git-LFS solve this by using HTTP to download large files, rather than the Git protocol. GitHub offers Git-LFS hosting but charges repository owners for bandwidth on popular files. Building a peer to peer distribution layer for files in a Git repository is difficult due to design of Git Packfiles which are delta compressed repository states that do not support random access to byte ranges in previous file versions. - -Science is an example of an important community that would benefit from better approaches in this area. Increasingly scientific datasets are being provided online using one of the above approaches, and cited in published literature. Broken links and systems that do not provide version checking or content addressability of data directly limit the reproducibility of scientific analyses based on shared datasets. Services that charge a premium for bandwidth cause monetary and data transfer strain on the users sharing the data, who are often on fast public university networks with effectively unlimited bandwidth. Version control tool designed for text files do not keep up with the demands of large data analysis in science today. - -## 2. INSPIRATION - -Dat is inspired by a number of features from existing systems. - -### Git - -Git popularized the idea of a Merkle DAG, a way to represent changes to data where each change is addressed by the secure hash of the change plus all previous hashes. This provides a way to trust data integrity, as the only way a specific hash could be derived by another peer is if they have the same data and change history required to reproduce that hash. This is important for reproducibility as it lets you trust that a specific git commit hash refers to a specific source code state. - -### LBFS - -LBFS is a networked file system that avoids transferring redundant data by deduplicating common regions of files and only transferring unique regions once. The deduplication algorithm they use is called Rabin fingerprinting and works by hashing the contents of the file using a sliding window and looking for content defined chunk boundaries that probabilistically appear at the desired byte offsets (e.g. every 1kb). - -Content defined chunking has the benefit of being shift resistant, meaning if you insert a byte into the middle of a file only the first chunk boundary to the right of the insert will change, but all other boundaries will remain the same. With a fixed size chunking strategy, such as the one used by rsync, all chunk boundaries to the right of the insert will be shifted by one byte, meaning half of the chunks of the file would need to be retransmitted. - -### BitTorrent - -BitTorrent implements a swarm based file sharing protocol for static datasets. Data is split into fixed sized chunks, hashed, and then that hash is used to discover peers that have the same data. An advantage of using BitTorrent for dataset transfers is that download speeds can be saturated. Since the file is split into pieces, and peers can efficiently discover which pieces each of the peers they are connected to have, it means one peer can download non-overlapping regions of the dataset from many peers at the same time in parallel, maximizing network throughput. - -Fixed sized chunking has drawbacks for data that changes (see LBFS above). Additionally, BitTorrent assumes all metadata will be transferred up front, and most clients divide data into 1024 pieces, meaning large datasets could have a very large chunk size which impacts random access performance (e.g. for streaming video over BitTorrent). - -### Kademlia Distributed Hash Table - -Kademlia is a distributed hash table, in other words a distributed key/value store that can serve a similar purpose to DNS servers but has no hard coded server addresses. All clients in Kademlia are also servers. As long as you know at least one address of another peer in the network, you can ask them for the key you are trying to find and they will either have it or give you some other people to talk to that are more likely to have it. - -If you don't have an initial peer to talk to you have to use something like a bootstrap server that just randomly gives you a peer in the network to start with. If the bootstrap server goes down, the network still functions, and other methods can be used to bootstrap new peers (such as sending them peer addresses through side channels like how .torrent files include tracker addresses to try in case Kademlia finds no peers). - -Kademlia is distinct from previous DHT designs such as Chord due to its simplicity. It uses a very simple XOR operation between two keys as it's distance metric to decide which peers are closer to the data being searched for. On paper it seems like it wouldn't work, as it doesn't take into account things like ping speed or bandwidth. Instead it's design is very simple on purpose, to minimize the amount of control/gossip messages, and to minimize the amount of complexity required to implement it. In practice Kademlia has been extremely successful and is widely deployed as the "Mainline DHT" for BitTorrent, with support in all popular BitTorrent clients today. - -### Peer to Peer Streaming Peer Protocol (PPSPP) - -PPSPP ([IETF RFC 7574](https://datatracker.ietf.org/doc/rfc7574/?include_text=1)) is a protocol for live streaming content over a peer to peer network. In it they define a specific type of Merkle Tree that allows for subsets of the hashes to be requested by a peer in order to reduce the time-till-playback for end users. BitTorrent for example transfers all hashes up front, which is not suitable for live streaming. - -Their Merkle trees are ordered using a scheme they call "bin numbering", which is a method for deterministically arranging an append-only log of leaf nodes into an in-order layout tree where non-leaf nodes are derived hashes. If you want to verify a specific node, you only need to request its sibling's hash and all its uncle hashes. PPSPP is very concerned with reducing round trip time and time-till-playback by allowing for many kinds of optimizations to pack as many hashes into datagrams as possible when exchanging tree information with peers. - -The ability to request a subset of metadata from a large and/or streaming dataset is very desirable for the Dat use case. - -### WebTorrent - -With WebRTC, browsers can now make peer to peer connections directly to other browsers. BitTorrent uses UDP sockets which aren't available to browser JavaScript, so can't be used as-is on the Web. - -WebTorrent implements the BitTorrent protocol in JavaScript using WebRTC as the transport. This includes the BitTorrent block exchange protocol as well as the tracker protocol implemented in a way that can enable hybrid nodes, talking simultaneously to both BitTorrent and WebTorrent swarms (if a peer is capable of making both UDP sockets as well as WebRTC sockets). Trackers are exposed to web clients over HTTP or WebSockets. In a normal web browser you can only use WebRTC to exchange data with peers. - -### InterPlanetary File System - -IPFS also builds on many of the concepts from this section and presents a new platform similar in scope to the Web that has content integrity, peer to peer file sharing, version history and data permanence baked in as a sort of upgrade to the current Web. Whereas Dat is one application of these ideas that is specifically focused on sharing datasets but is agnostic to what platform it is built on, IPFS goes lower level and abstracts network sockets and naming systems so that any application built on the Web can alternatively be built on IPFS to inherit it's properties, as long as their hyperlinks can be expressed as content addressed addresses to the IPFS global Merkle DAG. - -The research behind IPFS has coalesced many of these ideas into a more accessible format. We are still exploring how to best implement the Dat protocol on top of the IPFS platform. - -## 3. DESIGN - -Dat is a file sharing protocol that does not assume a dataset is static or that the entire dataset will be downloaded. The protocol is agnostic to the underlying transport, e.g. you could implement Dat over carrier pigeon. The key properties of the Dat design are explained in this section. - -- 1. **Mirroring** - All participants in the network simultaneously share and consume data. -- 2. **Content Integrity** - Data and publisher integrity is verified through use of signed content addressable hashes -- 3. **Parallel transfer** - Subsets of the data can be accessed from multiple peers simultaneously, improving transfer speeds -- 4. **Streaming updates** - Datasets can be updated and distributed in real time to downstream peers -- 5. **Secure Metadata** - Dat employs a capability system whereby anyone with a Dat link can connect to the swarm, but the link itself is a secure hash that is nearly impossible to guess - -## 3.1 Mirroring - -Dat is a peer to peer protocol designed to exchange pieces of a dataset amongst a swarm of peers. When a peer acquires their first piece of data in the dataset, they are now a partial mirror for the dataset. If someone else contacts them and needs the piece they have, they can share it. This can happen simultaneously while the peer is still downloading the pieces they want. - -### 3.1.1 Source Discovery - -An important aspect of mirring is source discovery, the techniques that peers use to find each other. Source discovery means finding the IP and port of data sources online that have a copy of that data you are looking for. You can then connect to them and begin exchanging data using the Dat file exchange protocol, Hypercore. By using source discovery techniques we are able to create a network where data can be discovered even if the original data source disappears. - -Source discovery can happen over many kinds of networks, as long as you can model the following actions: - -- `join(key, [port])` - Begin performing regular lookups on an interval for `key`. Specify `port` if you want to announce that you share `key` as well. -- `leave(key, [port])` - Stop looking for `key`. Specify `port` to stop announcing that you share `key` as well. -- `foundpeer(key, ip, port)` - Called when a peer is found by a lookup - -In the Dat implementation we implement the above actions on top of three types of discovery networks: - -- DNS name servers - An Internet standard mechanism for resolving keys to addresses -- Multicast DNS - Useful for discovering peers on local networks -- Kademlia Mainline Distributed Hash Table - Zero point of failure, increases probability of Dat working even if DNS servers are unreachable - -Additional discovery networks can be implemented as needed. We chose the above three as a starting point to have a complementary mix of strategies to increase the probability of source discovery. - -Our implementation of peer discovery is called discovery-channel. We also run a [custom DNS server](https://www.npmjs.com/package/dns-discovery) that Dat clients use (in addition to specifying their own if they need to), as well as a [DHT bootstrap](https://github.com/bittorrent/bootstrap-dht) server. These discovery servers are the only centralized infrastructure we need for Dat to work over the Internet, but they are redundant, interchangeable, never see the actual data being shared, anyone can run their own and Dat will still work even if they all are unavailable. If this happens discovery will just be manual (e.g. manually sharing IP/ports). Every data source that has a copy of the data also advertises themselves across these discovery networks. - -### 3.1.2 Peer Connections - -Up until this point we have just done searches to find who has the data we need. Now that we know who should talk to, we have to connect to them. Once we have a duplex binary connection to a peer we then layer on our own file sharing protocol on top, called [Hypercore](https://github.com/mafintosh/hypercore). - -In our implementation, we use either [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol), [UTP](https://en.wikipedia.org/wiki/Micro_Transport_Protocol) or WebRTC sockets for the actual peer to peer connections. UTP is nice because it is designed to *not* take up all available bandwidth on a network (e.g. so that other people sharing your wifi can still use the Internet). WebRTC support makes Dat work in modern web browsers using peer to peer connections. - -When we get the IP and port for a potential source we try to connect using all available protocols and hope one works. If one connects first, we abort the other ones. If none connect, we try again until we decide that source is offline or unavailable to use and we stop trying to connect to them. Sources we are able to connect to go into a list of known good sources, so that if our Internet connection goes down we can use that list to reconnect to our good sources again quickly. - -If we get a lot of potential sources we pick a handful at random to try and connect to and keep the rest around as additional sources to use later in case we decide we need more sources. A lot of these are parameters that we can tune for different scenarios later, but have started with some best guesses as defaults. - -The connection logic is implemented in a module called [discovery-swarm](https://www.npmjs.com/package/discovery-swarm). This builds on discovery-channel and adds connection establishment, management and statistics. You can see stats like how many sources are currently connected, how many good and bad behaving sources you've talked to, and it automatically handles connecting and reconnecting to sources for you. Our UTP support is implemented in the module [utp-native](https://www.npmjs.com/package/utp-native). - -So now we have found data sources, have connected to them, but we havent yet figured out if they *actually* have the data we need. This is where our file transfer protocol [Hyperdrive](https://www.npmjs.com/package/hyperdrive) comes in. This is explained in a later section. - -Peer connections types are outside the scope of the Dat protocol, but in the Dat implementation we make a best effort to make as successful connections using our default types as possible. This means employing peer to peer connection techniques like UDP hole punching. Our approach to hole punching is to use a central known server, in our case it is our DNS server, which is accessible on the public Internet. - -In a scenario where two peers A and B want to connect, and both know the central server: - -1. Peer A creates a local UDP socket and messages the central server that it is interested in connecting to people. -2. Central server messages Peer A back with a token that is a `hash(Peer A's remote IP + a local secret)`. The UDP packet contains the remote IP. -3. Peer A messages the central server with the token (this way you cannot spoof your IP and DDOS a remote peer) -4. Peer B does the same. -5. When the central server receives Peer B's message that it wants to connect to peers it forwards Peer B's message to Peer A and Peer A's message to Peer B. -6. Both peers now send a message to each other on their public IP and port. If UDP hole punching is supported by the routers of both peers, one of the messages should get through. -7. At this point we reuse the UDP socket to run UTP on top to get a streaming reliable interface. - -## 3.2 Content Integrity - -Content integrity means being able to verify the data you received is the exact same version of the data that you expected. This is important for reproducibility, as -- cgit v1.2.3 From b91d4d18bebd74c99ef1d112d912d5823efed1a9 Mon Sep 17 00:00:00 2001 From: Mathias Buus Date: Sun, 22 May 2016 20:08:27 +0000 Subject: more docs --- hyperdrive.md | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 58 insertions(+), 10 deletions(-) diff --git a/hyperdrive.md b/hyperdrive.md index a1b5ae5..daba5d6 100644 --- a/hyperdrive.md +++ b/hyperdrive.md @@ -181,7 +181,7 @@ This means that two datasets share a similar sequence of data the merkle tree he As described above the top hash of a merkle tree is the hash of all its content. This has both advantages and disadvanteges. -An advantage is that you can always reproduce a merkle tree simply by having the data contents of a merkle tree. +An advantage is that you can always reproduce a merkle tree simply by having the data contents of a merkle tree. A disadvantage is every time you add content to your data set your merkle tree hash changes and you'll need to re-distribute the new hash. @@ -264,7 +264,7 @@ These digests are very compact in size, only `(log2(number-of-blocks) + 2) / 8` ### Basic Privacy -(talk about the privacy features + public id here) +(talk about the privacy features + discovery key here) ## Hypercore Feeds @@ -284,12 +284,14 @@ This should be the first message sent and is also the only message without a typ ``` proto message Open { - required bytes publicId = 1; + required bytes feed = 1; required bytes nonce = 2; } ``` -The `publicId` should be set to the public id of the Merkle Tree as specified above. The `nonce` should be set to 24 bytes of random data. When running in encrypted mode this is the only message sent unencrypted. +The `feed` should be set to the discovery key of the Merkle Tree as specified above. The `nonce` should be set to 24 bytes of high entropy random data. When running in encrypted mode this is the only message sent unencrypted. + +When you are done using a channel send an empty message to indicate end-of-channel. #### `0` Handshake @@ -297,9 +299,8 @@ The message contains the protocol handshake. It has type `0`. ``` proto message Handshake { - optional uint64 version = 1; - required bytes peerId = 2; - repeated string extensions = 3; + required bytes id = 1; + repeated string extensions = 2; } ``` @@ -330,12 +331,59 @@ message Want { } ``` -You should only send the want message if you are interesting in a section of the feed that the other peer has not told you about. +You should only send the want message if you are interested in a section of the feed that the other peer has not told you about. #### `3` Request -#### `4` Response + +Send this message to request a block of data. You can request a block by block index or byte offset. If you are only interested +in the hash of a block you can set the hash property to true. The nodes property can be set to a tree digest of the tree nodes you already +have for this block or byte range. A request message has type `3`. + +``` proto +message Request { + optional uint64 block = 1; + optional uint64 bytes = 2; + optional bool hash = 3; + optional uint64 nodes = 4; +} +``` + +#### `4` Data + +Send a block of data to the other peer. You can use this message to reply to a request or optimistically send other blocks of data to the other client. It has type `4`. + +``` proto +message Data { + message Node { + required uint64 index = 1; + required uint64 size = 2; + required bytes hash = 3; + } + + required uint64 block = 1; + optional bytes value = 2; + repeated Node nodes = 3; + optional bytes signature = 4; +} +```` + #### `5` Cancel + +Cancel a previous sent request. It has type `5`. + +``` proto +message Cancel { + optional uint64 block = 1; + optional uint64 bytes = 2; +} +``` + #### `6` Pause + +An empty message that tells the other peer that they should stop requesting new blocks of data. It has type `6`. + #### `7` Resume -#### `8` Close + +An empty message that tells the other peer that they can continue requesting new blocks of data. It has type `7`. + -- cgit v1.2.3 From cb8fc2ed0a7cc12966447044922d1fd0684f9f71 Mon Sep 17 00:00:00 2001 From: Julian Gruber Date: Mon, 23 May 2016 15:44:09 +0200 Subject: typo --- dat-paper.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dat-paper.md b/dat-paper.md index 99a08ef..b4598dd 100644 --- a/dat-paper.md +++ b/dat-paper.md @@ -4,7 +4,7 @@ Dat is a swarm based version control system designed for sharing large datasets # 1. Introduction -There are countless ways to share share datasets over the Internet today. The simplest and most widely used approach, sharing files over HTTP, is subject to dead links when files are moved or deleted, as HTTP has no concept of history or versioning built in. E-mailing datasets as attachments is also widely used, and has the concept of history built in, but many email providers limit the maximum attachment size which makes it impractical for many datasets. +There are countless ways to share datasets over the Internet today. The simplest and most widely used approach, sharing files over HTTP, is subject to dead links when files are moved or deleted, as HTTP has no concept of history or versioning built in. E-mailing datasets as attachments is also widely used, and has the concept of history built in, but many email providers limit the maximum attachment size which makes it impractical for many datasets. Cloud storage services like S3 ensure availability of data, but as they have a centralized hub-and-spoke networking model tend to be limited by their bandwidth, meaning popular files can be come very expensive to share. Services like Dropbox and Google Drive provide version control and synchronization on top of cloud storage services which fixes many issues with broken links but rely on proprietary code and infrastructure requiring users to store their data on cloud infrastructure which has implications on cost, transfer speeds, and user privacy. -- cgit v1.2.3