aboutsummaryrefslogtreecommitdiffstats
path: root/hyperdrive.md
blob: 999a03c81e00355e8bbe3056dffeeb4d4b2006a0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# Hyperdrive + Hypercore Specification

## DRAFT Version 1

Hyperdrive is the peer-to-peer data distribution protocol that powers Dat. It consists of two parts. First there is hypercore which is the core protocol and swarm that handles distributing append-only logs of any binary data. The second part is hyperdrive which adds a filesystem specific protocol on top of hypercore.

## Hypercore

The goal of hypercore is to distribute append-only logs across a network of peers. Peers download parts of the logs from other peers and can choose to only download the parts of a log they care about. Logs can contain arbitrary binary data payloads.

A core goal is to be as simple and pragmatic as possible. This allows for easier implementations of clients which is an often overlooked property when implementing distributed systems. First class browser support is also an important goal as p2p data sharing in browsers is becoming more viable every day as WebRTC matures.

It also tries to be modular and export responsibilities to external modules whenever possible. Peer discovery is a good example of this as it handled by 3rd party modules that wasn't written with hyperdrive in mind. A benefit of this is a much smaller core implementation that can focus on smaller and simpler problems.

Prioritized synchronization of parts of a feed is also at the heart of hyperdrive as this allows for fast streaming with low latency of data such as structured datasets (wikipedia, genomic datasets), linux containers, audio, videos, and much more. To allow for low latency streaming another goal is also to keep verifiable block sizes as small as possible - even with huge data feeds.

The protocol itself draws heavy inspiration from existing file sharing systems such as BitTorrent and [PPSP](https://datatracker.ietf.org/doc/rfc7574/?include_text=1)

## How Hypercore works

### Flat Trees

A flat tree is a simple way represent a binary tree as a list. It also allows you to identify every node of a binary tree with a numeric index. Both of these properties makes it useful in distributed applications to simplify wire protocols that uses tree structures.

Flat trees are described in [PPSP RFC 7574 as "Bin numbers"](https://datatracker.ietf.org/doc/rfc7574/?include_text=1) and a node version is available through the [flat-tree](https://github.com/mafintosh/flat-tree) module.

A sample flat tree spanning 4 blocks of data looks like this:

```
0
  1
2
    3
4
  5
6
```

The even numbered entries represent data blocks (leaf nodes) and odd numbered entries represent parent nodes that have two children.

The depth of an tree node can be calculated by counting the number of trailing 1s a node has in binary notation.

```
5 in binary = 101 (one trailing 1)
3 in binary = 011 (two trailing 1s)
4 in binary = 100 (zero trailing 1s)
```

1 is the parent of (0, 2), 5 is the parent of (4, 6), and 3 is the parent of (1, 5).

If the number of leaf nodes is a power of 2 the flat tree will only have a single root.

Otherwise it'll have more than one. As an example here is a tree with 6 leafs:

```
0
  1
2
    3
4
  5
6

8
  9
10
```

The roots spanning all the above leafs are 3 an 9. Throughout this document we'll use following tree terminology:

* `parent` - a node that has two children (odd numbered)
* `leaf` - a node with no children (even numbered)
* `sibling` - the other node with whom a node has a mutual parent
* `uncle` - a parent's sibling

## Merkle Trees

A merkle tree is a binary tree where every leaf is a hash of a data block and every parent is the hash of both of its children.

Merkle trees are useful for ensuring the integrity of content.

Let's look at an example. Assume we have 4 data blocks, `(a, b, c, d)` and let `h(x)` be a hash function (the hyperdrive stack uses sha256 per default).

Using flat-tree notation the merkle tree spanning these data blocks looks like this:

```
0 = h(a)
  1 = h(0 + 2)
2 = h(b)
    3 = h(1 + 5)
4 = h(c)
  5 = h(4 + 6)
6 = h(d)
```

An interesting property of merkle trees is that the node 3 hashes the entire data set. Therefore we only need to trust node 3 to verify all data. However as we learned above there will only be a single root if there is a power of two data blocks.

Again lets expand our data set to contain 6 items `(a, b, c, d, e, f)`:

```
0 = h(a)
  1 = h(0 + 2)
2 = h(b)
    3 = h(1 + 5)
4 = h(c)
  5 = h(4 + 6)
6 = h(d)

8 = h(e)
  9 = h(8 + 10)
10 = h(f)
```

To ensure always have only a single root we'll simply hash all the roots together again. At most there will be `log2(number of data blocks)`.

In addition to hashing the roots we'll also include a bin endian uint64 binary representation of the corresponding node index.

Using the two above examples the final hashes would be:

```
hash1 = h(uint64be(#3) + 3)
hash2 = h(uint64be(#9) + 9 + uint64be(#3) + 3)
```

Each of these hashes can be used to fully verify each of the trees. Let's look at another example. Assume we trust `hash1` and another person wants to send block `0` to us. To verify block `0` the other person would also have to send the sibling hash and uncles until it reaches a root and the other missing root hashes. For the first tree that would mean hashes `(2, 5)`.

Using these hashes we can reproduce `hash1` in the following way:

```
0 = h(block received)
  1 = h(0 + 2)
2 = (hash received)
    3 = h(1 + 5)
  5 = (hash received)
```

If `h(uint64be(#3) + 3) == hash1` then we know that data we received from the other person is correct. They sent us `a` and the corresponding hashes.

Since we only need uncle hashes to verify the block the amount of hashes we need is at worst `log2(number-of-blocks)` and the roots of the merkle trees which has the same complexity.

A merkle tree generator is available on npm through the [merkle-tree-stream](https://github.com/mafintosh/merkle-tree-stream) module.

## Merkle Tree Deduplication

Merkle trees have another great property. They make it easy to deduplicate content that is similar.

Assume we have two similar datasets:

```
(a, b, c, d, e)
(a, b, c, d, f)
```

These two datasets are the same except their last element is different. When generating merkle trees for the two data sets you'd get two different root hashes out.

However if we look a the flat-tree notation for the two trees:

```
0
  1
2
    3
4
  5
6

8
```

We'll notice that the hash stored at 3 will be the same for both trees since the first four blocks are the same. Since we also send uncle hashes when sending a block of data we'll receive the hash for 3 when we request any block. If we maintain a simple index that maps a hash into the range of data it covers we can detect that we already have the data spanning 3 and we won't need to re-download that from another person.

```
1 -> (a, b)
3 -> (a, b, c, d)
5 -> (c, d)
```

This means that two datasets share a similar sequence of data the merkle tree helps you detect that.

## Signed Merkle Trees

As described above the top hash of a merkle tree is the hash of all its content. This has both advantages and disadvanteges.

An advantage is that you can always reproduce a merkle tree simply by having the data contents of a merkle tree. 

A disadvantage is every time you add content to your data set your merkle tree hash changes and you'll need to re-distribute the new hash.

Using a bit of cryptography however we can make our merkle tree appendable. First generate a cryptographic key pair that can be used to sign data using [ed25519](https://ed25519.cr.yp.to/) keys, as they are compact in size (32 byte public keys). A key pair (public key, secret key) can be used to sign data. Signing data means that if you trust a public key and you receive data and a signature for that data you can verify that a signature was generated with the corresponding secret key.

How does this relate to merkle trees? Instead of distributing the hash of a merkle tree we can distribute our public key instead. We then use our secret key to continously sign the merkle trees of our data set every time we append to it.

Assume we have a data set with only a single item in it `(a)` and a key pair `(secret, public)`:

```
(a)
```

We generate a merkle tree for this data set which will have the roots `0` and sign the hash of these roots (see the merkle tree section) with our secret key.

If we want to send `a` to another person and they trust our public key we simply send `a` and the uncles needed to generate the roots plus our signature.

If we append a new item to our data set we simply do the same thing:

```
(a, b)
```

Notice that all new signatures verify the entire dataset since they all sign a merkle tree that spans all data. This serves two purposes. First of all it makes sure that the dataset publisher cannot change old data. It also ensures that the publisher cannot share different versions of the same dataset to different persons without the other people noticing it (at some point they'll get a signature for the same node index that has different hashes if they talk to multiple people).

This technique has the added benefit that you can always convert a signed merkle tree to a normal unsigned one if you wish (or turn an unsigned tree into a signed tree).

In general you should send as wide as possible signed tree back when using signed merkle trees as that lowers the amount of signatures the other person needs to verify which has a positive performance impact for some platforms. It will also allow other users to more quickly detect if a tree has duplicated content.

## Block Tree Digest

When asking for a block of data we want to reduce the amount of duplicate hashes that are sent back.

In the merkle tree example for from earlier we ended up sending two hashes `(2, 5)` to verify block `0`.

```
// If we trust 3 then 2 and 5 are needed to verify 0

0
  1
2
    3
4
  5
6
```

Now if we ask for block `1` afterwards (`2` in flat tree notation) the other person doesn't need to send us any new hashes since we already received the hash for `2` when fetching block `0`.

If we only use non-signed merkle trees the other person can easily calculate which hashes we already have if we tell them which blocks we've got.

This however isn't always possible if we use a signed merkle tree since the roots are changing. In general it also useful to be able to communicate that you have some hashes already without disclosing all the blocks you have.

To communicate which hashes we have just have to communicate two things: which uncles we have and whether or not we have any parent node that can verify the tree.

Looking at the above tree that means if we want to fetch block `0` we need to communicate whether of not we already have the uncles `(2, 5)` and the parent `3`. This information can be compressed into very small bit vector using the following scheme.

Let the trailing bit donate whether or not the leading bit is a parent and not a uncle. Let the previous trailing bits denote wheather or not we have the next uncle.

For example for block `0` the following bit vector `1011` is decoded the following way

```
// for block 0

101(1) <-- tell us that the last bit is a parent and not an uncle
10(1)1 <-- we already have the first uncle, 2 so don't send us that
1(0)11 <-- we don't have the next uncle, 5
(1)000 <-- the final bit so this is parent. we have the next parent, 3
```

So using this digest the person can easily figure out that they only need to send us one hash, `5`, for us to verify block `0`.

The bit vector `1` (only contains a single one) means that we already have all the hashes we need so just send us the block.

These digests are very compact in size, only `(log2(number-of-blocks) + 2) / 8` bytes needed in the worst case. For example if you are sharing one trillion blocks of data the digest would be `(log2(1000000000000) + 2) / 8 ~= 6` bytes long.