1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
|
Title: **DEP-0004: Hyperdb**
Short Name: `0004-hyperdb`
Type: Standard
Status: Draft (as of 2018-05-06)
Github PR: [Draft](https://github.com/datprotocol/DEPs/pull/3)
Authors:
[Bryan Newbold](https://github.com/bnewbold),
[Stephen Whitmore](https://github.com/noffle),
[Mathias Buus](https://github.com/mafintosh)
# Summary
[summary]: #summary
Hyperdb is an abstraction layer providing a general purpose distributed
key/value store over the hypercore protocol. It is an iteration on the
hyperdrive directory tree implementation, building on top of the hypercore
append-only log abstraction layer. Keys are path-like strings (e.g.,
`/food/fruit/kiwi`), and values are arbitrary binary blobs (generally under a
megabyte).
Hyperdrive (used by the Dat application) is expected to be re-implemented on
top of hyperdb for improved performance with many files (e.g., millions). The
hyperdrive API should be largely unchanged, but the `metadata` format will be
backwards-incompatible.
# Motivation
[motivation]: #motivation
Hyperdb is expected to drastically improve performance of dat clients when
working with archives containing tens of thousands of files in single
directories. This is a real-world bottleneck for several current users, with
basic local actions such as adding a directory taking an unacceptably long time
to complete.
A secondary benefit is to refactor the [trie][trie]-structured key/value API
out of hyperdrive, allowing third party code to build applications directly on
this abstraction layer.
[trie]: https://en.wikipedia.org/wiki/Trie
# Usage Documentation
[usage-documentation]: #usage-documentation
*This section describes Hyperdb's interface and behavior in the abstract for
application programmers. It is not intended to be exact documentation of any
particular implementation (including the reference Javascript module).*
Hyperdb is structured to be used much like a traditional hierarchical
filesystem. A value can be written and read at locations like `/foo/bar/baz`,
and the API supports querying or tracking values at subpaths, like how watching
for changes on `/foo/bar` will report both changes to `/foo/bar/baz` and also
`/foo/bar/19`.
Lower-level details of the hypercore append-only log, disk serialization, and
networked synchronization features that Hyperdb builds on top of are not
described in detail here; see the [DEP repository][deps]. Multi-writer
hypercore semantics are also not discussed in this DEP.
[deps]: https://github.com/datprotocol/DEPs
A Hyperdb database instance can be represented by a single hypercore feed (or
several feeds in a multi-writer context), and is named, referenced, and
discovered using the public and discovery keys of the hypercore feed (or the
original feed if there are several). In a single-writer configuration, only a
single node (holding the secret key) can mutate the database (e.g., via `put` or
`delete` actions).
**Keys** can be any UTF-8 string. Path segments are separated by the forward
slash character (`/`). Repeated slashes (`//`) are disallowed. Leading and
trailing `/` are optional in application code: `/hello` and `hello` are
equivalent. A key can be both a "path segment" and key at the same time; e.g.,
`/a/b/c` and `/a/b` can both be keys at the same time.
**Values** can be any binary blob, including empty (of zero length). For
example, values could be UTF-8 encoded strings, JSON encoded objects, protobuf
messages, or a raw `uint64` integer (of either endian-ness). Length is the only
form of type or metadata stored about the value; deserialization and validation
are left to library and application developers.
## Core API Semantics
[core_api]: #core_api
A `db` is instantiated by opening an existing hypercore with hyperdb content
(read-only, or optionally read-write if the secret key is available), or
creating a new one. A handle could represent any specific revision in history,
or the "latest" revision.
`db.put(key, value)`: inserts `value` (arbitrary bytes) under the path `key`.
Requires read-write access. Returns an error (e.g., via callback) if there was a
problem.
`db.get(key)`: Reading a non-existent `key` is an error. Read-only.
`db.delete(key)`: Removes the key from the database. Deleting a non-existent
key is an error. Requires read-write access.
`db.list(prefix)`: returns a flat (not nested) list of all keys currently in
the database under the given prefix. Prefixes operate on a path-segment basis:
`/ab` is not a valid prefix for key `/abcd`, but is valid for `/ab/cd`. If the
prefix does not exist, returns an empty list. The order of returned keys is
implementation (or configuration) specific. Default listing is recursive
(implementations may have a flag to control this behavior). Read-only.
If the hypercore underlying a hyperdb is only partially replicated, behavior is
implementation-specific. For example, a `get()` call could block until the
relevant value is replicated, or the implementation could return an error.
An example pseudo-code session working with a database might be:
db.put('/life/animal/mammal/kitten', '{"cuteness": 500.3}')
db.put('/life/plant/bush/banana', '{"delicious": 103.4}')
db.delete('/life/plant/bush/banana')
db.put('/life/plant/tree/banana', '{"delicious": 103.4}')
db.get('/life/animal/mammal/kitten')
=> {"cuteness": 500.3}
db.list('/life/')
=> ['/life/animal/mammal/kitten', '/life/plant/tree/banana']
# Reference Documentation
[reference-documentation]: #reference-documentation
A hyperdb hypercore feed typically consists of a sequence of protobuf-encoded
messages of "Entry" type. Higher-level protocols may make exception to this,
for example by prepending an application-specific metadata message as the first
entry in the feed. There is sometimes a second "content" feed associated with
the primary hyperdb key/value feed, to store data that does not fit in the
(limited) `value` size constraint.
The sequence of entries includes an incremental index: the most recent entry in
the feed contains metadata pointers that can be followed to efficiently look up
any key in the database without needing to linear scan the entire history or
generate an independent index data structure. Implementations are, of course,
free to maintain their own index if they prefer.
The Entry protobuf message schema is:
message Entry {
message Feed {
required bytes key = 1;
}
required string key = 1;
optional bytes value = 2;
required bytes trie = 3;
repeated uint64 clock = 4;
optional uint64 inflate = 5;
repeated Feed feeds = 6;
optional bytes contentFeed = 7;
}
Some fields are are specific to the multi-writer features described in their
own DEP and mentioned only partially here. The fields common to both message
types are:
- `key`: UTF-8 key that this node describes. Leading and trailing forward
slashes (`/`) are always stripped before storing in protobuf.
- `value`: arbitrary byte array. A non-existent `value` entry indicates that
this Entry indicates a deletion of the key; this is distinct from specifying
an empty (zero-length) value.
- `trie`: a structured array of pointers to other Entry entries in the feed,
used for navigating the tree of keys.
- `clock`: reserved for use in the forthcoming `multi-writer` standard. An
empty list is the safe (and expected) value for `clock` in single-writer use
cases.
- `inflate`: a "pointer" (reference to a feed index number) to the most recent
entry in the feed that has the optional `feeds` and `contentFeed` fields set.
Not defined if `feeds` and `contentFeed` are defined in the same message.
- `feeds`: reserved for use with `multi-writer`. The safe single-writer value is
to use the current feed's hypercore public key.
- `contentFeed`: for applications which require a parallel "content" hypercore
feed for larger data, this field can be used to store the 32-byte public key
for that feed. This field should have a single value for the entire history
of the feed (aka, it is not mutable).
For the case of a single-writer feed, not using multi-writer features, it is
sufficient to write a single Entry message as the first entry in the hypercore
feed, with `feeds` containing a single entry (a pointer to the current feed
itself), and `contentFeed` optionally set to a pointer to a paired content
feed.
If *either* `feeds` *or* `contentFeed` are defined in an entry, *both* fields
must be defined, so the new message can be referred to via `inflated`. In this
case the entry is refereed to as an "inflated entry".
## Path Hashing
[path_hashing]: #path_hashing
Every key path has component-wise fixed-size hash representation that is used
by the trie. The concatenation of all path hashes yields a "path hash array"
for the entire key. Note that analogously to a hash map data structure, there
can be more than one key (string) with the same key hash in the same database
with no problems: the hash points to a linked-list "bucket" of Entries, which
can be iterated over linearly to find the correct value.
The path hash is represented by an array of bytes. Elements are 2-bit encoded
(values 0, 1, 2, 3), except for an optional terminating element which has value
4. Each path element consists of 32 values, representing a 64-bit hash of that
path element. For example, the key `/tree/willow` has two path segments (`tree`
and `willow`), and will be represented by a 65 element path hash array (two 32
element hashes plus a terminator).
The hash algorithm used is `SipHash-2-4`, with an 8-byte output and
16-byte key; the input is the UTF-8 encoded path string segment, without any
`/` separators or terminating null bytes. An implementation of this hash
algorithm is included in the libsodium library in the form of the
`crypto_shorthash()` function. A 16-byte "secret" key is required; for this use
case we use all zeros.
When converting the 8-bytehash to an array of 2-bit bytes, the ordering is
proceed byte-by-byte, and for each take the two lowest-value bits (aka, `hash &
0x3`) as byte index `0`, the next two bits (aka, `hash & 0xC`) as byte index
`1`, etc. When concatenating path hashes into a longer array, the first
("left-most") path element hash will correspond to byte indexes 0 through 31;
the terminator (`4`) will have the highest byte index.
For example, consider the key `/tree/willow`. `tree` has a hash `[0xAC, 0xDC,
0x05, 0x6C, 0x63, 0x9D, 0x87, 0xCA]`, which converts into the array:
```
[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3 ]
```
`willow` has a 64-bit hash `[0x72, 0x30, 0x34, 0x39, 0x35, 0xA8, 0x21, 0x44]`,
which converts into the array:
```
[ 2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1 ]
```
These combine into the unified byte array with 65 elements:
```
[ 0, 3, 2, 2, 0, 3, 1, 3, 1, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 1, 1, 3, 1, 2, 3, 1, 0, 2, 2, 2, 0, 3,
2, 0, 3, 1, 0, 0, 3, 0, 0, 1, 3, 0, 1, 2, 3, 0, 1, 1, 3, 0, 0, 2, 2, 2, 1, 0, 2, 0, 0, 1, 0, 1,
4 ]
```
As another example, the key `/a/b/c` converts into the 97-byte hash array:
```
[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3,
0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0,
4 ]
```
Note that "hash collisions" are rare with this hashing scheme, but are likely
to occur with large databases (millions of keys), and collision have been
generated as a proof of concept. Implementations should take care to properly
handle collisions by verifying keys and following bucket pointers (see the next
section).
An example hash collision (useful for testing; thanks to Github user
`dcposch`):
/mpomeiehc
/idgcmnmna
<!--
Generation code (javascript) for the above:
var sodium = require('sodium-universal')
var toBuffer = require('to-buffer')
var KEY = Buffer.alloc(16)
var OUT = Buffer.alloc(8)
sodium.crypto_shorthash(OUT, toBuffer('tree'), KEY)
console.log("tree: ", OUT)
console.log(hash('tree', true))
sodium.crypto_shorthash(OUT, toBuffer('willow'), KEY)
console.log("willow: ", OUT)
console.log(hash('willow', true))
sodium.crypto_shorthash(OUT, toBuffer('a'), KEY)
console.log("a: ", OUT)
console.log(hash('a', true))
Then, to manually "expand" arrays in python3:
hash_array = [0x72, 0x30, 0x34, 0x39, 0x35, 0xA8, 0x21, 0x44]
path = []
tmp = [(x & 0x3, (x >> 2) & 0x3, (x >> 4) & 0x3, (x >> 6) & 0x3) for x in hash_array]
[path.extend(e) for e in tmp]
path
-->
## Incremental Index Trie
[trie_index]: #trie_index
Each node stores a *prefix [trie](https://en.wikipedia.org/wiki/Trie)* that
can be used to look up other keys, or to list all keys with a given prefix.
This is stored under the `trie` field of the Entry protobuf message.
The trie effectively mirrors the path hash array. Each element in the `trie`
is called a "bucket". Each non-empty bucket points to the newest Entries which
have an identical path up to that specific prefix location; because the trie
has 4 "values" at each node, there can be pointers to up to 3 other values at a
given element in the trie array. Buckets can be empty if there are no nodes
with path hashes that differ for the first time the given bucket (aka, there
are no "branches" at this node in the trie). Only non-null elements will be
transmitted or stored on disk.
The data structure of the trie is a sparse array of pointers to other Entry
entries. Each pointer indicates a feed index and an entry index pointer, and is
associated with a 2-bit value; for the non-multi-writer case, the feed index is
always 0, so we consider only the entry index.
For a `trie` with `N` buckets, each may have zero or more pointers. Typically
there are a maximum of 3 pointers per bucket, corresponding to the 4 possible
values minus the current Entry's value, but in the event of hash collisions (in
the path array space), there may be multiple pointers in the same bucket
corresponding to the same value.
To lookup a key in the database, the recipe is to:
1. Calculate the path hash array for the key you are looking for.
2. Select the most-recent ("latest") Entry for the feed.
3. Compare path hash arrays. If the paths match exactly, compare keys; they
match, you have found the you were looking for! Check whether the `value` is
defined or not; if not, this Entry represents that the key was deleted from
the database.
4. If the paths match, but not the keys, look for a pointer in the last `trie`
array index, and iterate from step #3 with the new Entry.
5. If the paths don't entirely match, find the first index at which the two
arrays differ, and look up the corresponding element in this Entry's `trie`
array. If the element is empty, or doesn't have a pointer corresponding to
your 2-bit value, then your key does not exist in this hyperdb.
6. If the trie element is not empty, then follow that pointer to select the
next Entry. Recursively repeat this process from step #3; you will be
descending the `trie` in a search, and will either terminate in the Entry you
are looking for, or find that the key is not defined in this hyperdb.
Similarly, to write a key to the database:
1. Calculate the path hash array for the key, and start with an empty `trie` of
the same length; you will write to the `trie` array from the current index,
which starts at 0.
2. Select the most-recent ("latest") Entry for the feed.
3. Compare path hash arrays. If the paths match exactly, and the keys match, then
you are overwriting the current Entry, and can copy the "remainder" of it's
`trie` up to your current `trie` index.
4. If the paths match, but not the keys, you are adding a new key to an
existing hash bucket. Copy the `trie` and extend it to the full length. Add
a pointer to the Entry with the same hash at the final array index.
5. If the paths don't entirely match, find the first index at which the two
arrays differ. Copy all `trie` elements (empty or not) into the new `trie`
for indices between the "current index" and the "differing index".
6. Next look up the corresponding element in this Entry's `trie` array at the
differing index. If this element is empty, then you have found the most
similar Entry. Write a pointer to this node to the `trie` at
the differing index, and you are done (all remaining `trie` elements are
empty, and can be omitted).
7. If the differing tree element has a pointer (is not empty), then follow that
pointer to select the next `Entry`. Recursively repeat this process from step
#3.
To delete a value, follow the same procedure as adding a key, but write the
`Entry` without a `value` (in protobuf, this is distinct from having a `value`
field with zero bytes). Deletion nodes may persist in the database forever.
## Binary Trie Encoding
[trie_encoding]: #trie_encoding
The following scheme is used to encode trie data structures (sparse, indexed
arrays of pointers to entries) into a variable-length bytestring as the `trie`
field of an Entry protobuf message.
Consider a trie array with `N` buckets and `M` non-empty buckets (`0 <= M <=
N`). In the encoded field, there will be `M` concatenated bytestrings of the
form:
- trie index (varint), followed by...
- bucket bitfield (packed in a varint), encoding which of the 5 values (4
values if the index is not modulo 32) at this node of the trie have pointers,
followed by one or more...
- pointer sets, each referencing an entry at (feed index, entry index):
- feed index (varint, with a extra "more pointers at this value" low bit,
encoded as `feed_index << 1 | more_bit`)
- entry index (varint)
In the common case for a small/sparse hyperdb, there will a small number of
non-empty buckets (small `M`), a usually a single `(feed index, entry index)`
pointer for those non-empty buckets. For a very large/dense hyperdb (millions
of key/value pairs), there will be many non-empty buckets (`M` approaching
`N`), and buckets may have up to the full 4 pointer sets. Even with millions of
entries, hash collisions will be very rare; in those cases there will be
multiple pointers in the same pointer set.
Consider an entry with path hash:
```
[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1,
0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1,
4 ]
```
and trie:
```
[ , { val: 1, feed: 0, index: 1 } ]
```
In this case `N` is 64 (or you could count as 2 if you ignore trailing empty
entries) and `M` is 1. There will be a single bytestring chunk:
- index varint is `1` (second element in the trie array)
- bitfield is `0b0010` (varint 2): there is only a pointer set for value 1 (the second value)
- there is a single pointer in the pointer set, which is (`feed=0 << 1 | 0`,
`index=1`), or (varint 2, varint 1)
Combined, the `trie` bytestring will be:
```
[0x01, 0x02, 0x02, 0x02]
```
For a more complex example, consider the same path hash, but the trie:
```
[ , { val: 1, feed: 0, index: 1; val: 2, feed: 5, index: 3; val: 3, feed: 6, index: 98 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
{ val: 4, feed: 0, index, 23; val: 4, feed: 1, index: 17 ]
```
Now `M` is 2. The first bytestring chunk will be:
- index varint is `1` (second element in the trie array)
- bitfield is `0b01110` (varint 9): there are three pointer sets
- first pointer set is (`feed=0 << 1 | 0`, `index=1`) or (varint 2, varint 1)
- second pointer set is (`feed=5 << 1 | 0`, `index=3`) or (varint 10, varint 3)
- third pointer set is (`feed=6 << 1 | 0`, `index=98`) or (varint 12, varint 98)
the second bytestring chunk would be:
- index varint is `64` (65th element in the trie array; the terminating value)
- bitfield is `0b10000` (varint 1); there is a single pointer set... but it
contains a hash collision, so there are two pointers
- first pointer is (`feed=0 << 1 | 1`, `index=23`) or (varint 1, varint=23);
note the `more` bit is set high!
- second pointer is (`feed=1 << 1 | 0`, `index=17`) or (varint 2, varint 17);
note the `more` bit is low, as usual. In the extremely unlikely case of
multiple collisions there could have been more pointers with `more` high
preceding this one.
The overall bytestring would be:
```
[0x01, 0x09, 0x02, 0x01, 0x0A, 0x03, 0x0C, 0x62, 0x40, 0x10, 0x01, 0x17, 0x02, 0x11]
```
# Examples
[examples]: #examples
## Simple Put and Get
Starting with an empty hyperdb `db`, if we `db.put('/a/b', '24')` we expect to
see a single `Entry` and index 0:
```
{ key: 'a/b',
value: '24',
trie:
[ ] }
```
For reference, the path hash array for this key (index 0) is:
```
[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
0, 1, 2, 3, 2, 2, 2, 0, 3, 1, 1, 3, 0, 3, 1, 3, 0, 1, 0, 1, 3, 2, 0, 2, 2, 3, 2, 2, 3, 3, 2, 3,
4 ]
```
Note that the first 64 bytes in path match those of the `/a/b/c` example from
the [path hashing][path_hash] section, because the first two path components
are the same. Since this is the first entry, the entry index is 0.
Now we `db.put('/a/c', 'hello')` and expect a second Entry:
```
{ key: 'a/c',
value: 'hello',
trie:
[ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
, , { element: 2, feed: 0, index: 0 } ] }
```
The path hash array for this key (index 1) is:
```
[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0,
4 ]
```
The first 32 characters of path are common with the first Entry (they share a
common prefix, `/a`).
`trie` is defined, but mostly sparse. The first 32 elements of common prefix
match the first Entry, and then two additional hash elements (`[0, 1]`) happen
to match as well; there is not a differing entry until index 34 (zero-indexed).
At this entry there is a reference pointing to the first Entry. An additional 29
trailing null entries have been trimmed in reduce metadata overhead.
Next we insert a third node with `db.put('/x/y', 'other')`, and get a third Entry:
```
{ key: 'x/y',
value: 'other',
trie:
[ , { val: 1, feed: 0, index: 1 } ],
```
The path hash array for this key (index 2) is:
```
[ 1, 1, 0, 0, 3, 1, 2, 3, 3, 1, 1, 1, 2, 2, 1, 1, 1, 0, 2, 3, 3, 0, 1, 2, 1, 1, 2, 3, 0, 0, 2, 1,
0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 3, 1, 0, 0, 2, 3, 0, 1, 3, 2, 0, 3, 2, 0, 1, 0, 3, 2, 0, 2, 1, 1,
4 ]
```
Consider the lookup-up process for `db.get('/a/b')` (which we expect to
successfully return `'24'`, as written in the first Entry). First we calculate
the path for the key `a/b`, which will be the same as the first Entry. Then we
take the "latest" Entry, with entry index 2. We compare the path hash arrays,
starting at the first element, and find the first difference at index 1 (`1 ==
1`, then `1 != 2`). We look at index 1 in the current Entry's `trie` and find a
pointer to entry index 1, so we fetch that Entry and recurse. Comparing path
hash arrays, we now get all the way to index 34 before there is a difference.
We again look in the `trie`, find a pointer to entry index 0, and fetch the
first Entry and recurse. Now the path elements match exactly; we have found the
Entry we are looking for, and it has an existent `value`, so we return the
`value`.
Consider a lookup for `db.get('/a/z')`; this key does not exist, so we expect
to return with "key not found". We calculate the path hash array for this key:
```
[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
1, 2, 3, 0, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 0, 3, 3, 2, 0, 3, 3, 1, 1, 0, 2, 1, 0, 1, 1, 2,
4 ]
```
Similar to the first lookup, we start with entry index 2 and follow the pointer to
entry index 1. This time, when we compare path hash arrays, the first differing
entry is at array index `32`. There is no `trie` entry at this index, which
tells us that the key does not exist in the database.
## Listing a Prefix
Continuing with the state of the database above, we call `db.list('/a')` to
list all keys with the prefix `/a`.
We generate a path hash array for the key `/a`, without the terminating symbol
(`4`):
```
[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2 ]
```
Using the same process as a `get()` lookup, we find the first Entry that
entirely matches this prefix, which will be entry index 1. If we had failed to
find any Entry with a complete prefix match, then we would return an empty list
of matching keys.
Starting with the first prefix-matching node, we save that key as a match
(unless the Entry is a deletion), then select all `trie` pointers with an index
higher than the prefix length, and recursively inspect all pointed-to Entries.
## Deleting a Key
Continuing with the state of the database above, we call `db.delete('/a/c')` to
remove that key from the database.
The process is almost entirely the same as inserting a new Entry at that key,
except that the `value` field is undefined. The new Entry (at entry index 3)
is:
```
{ key: 'a/c',
value: ,
trie: [ , { val: 1, feed: 0, index: 2 }, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
, , { val: 1, feed: 0, index: 0 } ] }
```
The path hash array for this Entry (key) is:
```
[ 1, 2, 0, 1, 2, 0, 2, 2, 3, 0, 1, 2, 1, 3, 0, 3, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 3, 2, 1, 1, 2,
0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0,
4 ]
```
# Drawbacks
[drawbacks]: #drawbacks
A backwards-incompatible change will have negative effects on the broader dat
ecosystem: clients will need to support both versions protocol for some time
(increasing maintenance burden), future clients may not inter-operate with old
archives, etc. These downsides can partially be avoided by careful roll-out.
For the specific use case of Dat archives, hyperdb will trivially increase
metadata size (and thus disk and network consumption) for archives with few
files.
# Overhead and Scaling
[overhead]: #overhead
The metadata overhead for a single database entry varies based on the size of
the database. In a "heavy" case, considering a two-path-segment key with an
entirely saturated `trie` and `uint32`-sized feed and entry index pointers, and
ignoring multi-writer fields:
- `trie`: 4 * 2 * 64 bytes = 512 bytes
- total: 512 bytes
In a "light" case, with few `trie` entries and single-byte varint feed and
entry index pointers:
- `trie`: 2 * 2 * 4 bytes = 16 bytes
- total: 16
For a database with most keys having N path segments, the cost of a `get()`
scales with the number of entries M as `O(log(M))` with best case 1 lookup and
worst case `4 * 32 * N = 128 * N` lookups (for a saturated `trie`).
The cost of a `put()` or `delete()` is proportional to the cost of a `get()`.
The cost of a `list()` is linear (`O(M)`) in the number of matching entries,
plus the cost of a single `get()`.
The total metadata overhead for a database with M entries scales with `O(M
* log(M))`.
# Privacy and Security Considerations
[privacy]: #privacy
The basic key/value semantics of hyperdb (as discussed in this DEP, not
considering multi-writer changes) are not known to introduce new privacy issues
when compared with, e.g., storing binary values at key-like paths using the
current ("legacy") hyperdrive system.
A malicious writer could cause trouble for readers, even readers who do not
trust the application-level contents of a feed. Implementations which may be
exposed to arbitrary feeds from unknown sources on the internet should take
care to the following scenarios: A malicious writer may be able to produce very
frequent key path hash collisions, which could degrade to linear performance. A
malicious writer could send broken trie structures that contain pointer loops,
duplicate pointers, or other invalid contents. A malicious writer could write
arbitrary data in value fields in an attempt to exploit de-serialization bugs.
A malicious writer could leverage non-printing unicode characters to create
database entries with user-indistinguishable names (keys).
# Rationale and alternatives
[alternatives]: #alternatives
A major motivator for hyperdb is to improve scaling performance with tens of
thousands through millions of files per directory in the existing hyperdrive
implementation. The current implementation requires the most recent node in a
directory to point to all other nodes in the directory. Even with pointer
compression, this requires on the order of `O(N^2)` bytes; the hyperdb
implementation scales with `O(N log(N))`.
The hyperdb specification (this document) is complicated by the inclusion of
new protobuf fields to support "multi-writer" features which are not described
here. The motivation to include these fields now to make only a single
backwards-incompatible schema change, and to make a second software-only change
in the future to enable support for these features. Schema and data format
changes are considered significantly more "expensive" for the community and
software ecosystem compared to software-only changes. Attempts have been made
in this specification to indicate the safe "single-writer-only" values to use
for these fields.
# Dat migration logistics
[migration]: #migration
Hyperdb is not backwards compatible with the existing hyperdrive metadata,
meaning dat clients may need to support both versions during a transition
period. This applies both to archives saved to disk (e,g., in SLEEP) and to
archives received and published to peers over the network.
No changes to the Dat network wire protocol itself are necessary, only changes
to content passed over the protocol. The Dat `content` feed, containing raw
file data, is not impacted by hyperdb, only the contents of the `metadata`
feed.
Upgrading a Dat (hyperdrive) archive to hyperdb will necessitate creating a new
feed from scratch, meaning new public/private key pairs, and that public key
URL links will need to change.
Further logistical details are left to the forthcoming Multi-Writer DEP.
# Unresolved questions
[unresolved]: #unresolved-questions
Need to think through deletion process with respect to listing a path prefix;
will previously deleted nodes be occluded, or potentially show up in list
results? Should be reviewed (by a non-author of this document) before accepted
as a Draft.
Can the deletion process (currently leaving "tombstone" entries in the `trie`
forever) be improved, such that these entries don't need to be iterated over?
mafintosh mentioned this might be in the works. Does this DEP need to "leave
room" for those changes, or should we call out the potential for future change?
Probably not, should only describe existing solutions. This can be resolved
after Draft.
There are implied "reasonable" limits on the size (in bytes) of both keys and
values, but they are not formally specified. Protobuf messages have a hard
specified limit of 2 GByte (due to 32-bit signed arithmetic), and most
implementations impose a (configurable) 64 MByte limit. Should this DEP impose
specific limits on key and value sizes? Would be good to decide before Draft
status.
Apart from leaving fields in the protobuf message specification, multi-writer
concerns are explicitly out of scope for this DEP.
# Changelog
[changelog]: #changelog
As of March 2018, Mathias Buus (@mafintosh) is leading development of a hyperdb
nodejs module on [github](https://github.com/mafintosh/hyperdb), which is the
basis for this DEP.
- 2017-12-06: Stephen Whitmore (@noffle) publishes `ARCHITECTURE.md` overview
in the [hyperdb github repo][arch_md]
- 2018-03-04: First draft for review
- 2018-03-15: Hyperdb v3.0.0 is released
- 2018-04-18: This DEP submitted for Draft review.
- 2018-05-06: Merged as Draft after WG approval.
[arch_md]: https://github.com/mafintosh/hyperdb/blob/master/ARCHITECTURE.md
|