aboutsummaryrefslogtreecommitdiffstats
path: root/posts/merkle_design.md
blob: dc1e27155d847428a98ece1cf53d8d96def7315f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
Title: Design Considerations for Merkle-Tree Storage Systems
Author: bnewbold
Date: 2018-06-10
Tags: tech, dweb
Status: draft

## Features we Want

Four semantic properties we might want from a universal
content-addressiblestorage system:

**1. Deterministic file-level addressability, enabling file-level efficiency
on-wire and at rest.** If every distinct file can be identified by only a
single, reproducible name, then discovery, indexing, and de-duplicaiton is made
easier.  If the same file can end up with different names, then that file might
be transfered or stored separately by default; this creates pressure for the
application layer to support the concept of "many identifiers for the same
file", and requires additional coordination at scale.

**2. Simple chunk-level de-duplication between files and versions.** This means
that if you have two large files with only a few bytes *changed* between them,
you don't need both full copies, only data proportional to the difference. A
special case is when an existing file is copied and modified; in this case the
system can track the history and change. This is distinct from adding two files
at the same time and detecting that there is a large degree of overlap.

**3. Offset-independent, chunk-level du-duplication between files.** Stronger
than #2, this method is efficient even if the different between two files is
one of inserting (or deleting) and *offset* of bytes; the challenge of
detecting that two files are identical except for an offset is harder than that
of identifying the identical bytes at the same locations.

**4. File-level Interoperability with legacy and future systems.** Can the
system be used as a transparent "layer" by other systems? Eg, can a thin proxy
be implemented on top of existing file-systems and blob stores? Can thin
file-system and blob-store gateways be layered on top of the storage system? A
common source of friction here is when generic, off-the-shelf full-file hashes
like SHA-1, SHA-256, or BLAKE2b are not used in a common manner.

This last one doesn't matter if you are planning on Total World Domination with
no need for future upgrades.

## Existing Implementations

git nails #1 (at the cost of not having an upgrade path for the hash function).
It contains implementation work-arounds for #2 and #3: an internal compression
format allows storing and transmitting only the diffs between two versions of a
file, instead of the file files themselves. This isn't baked in to the
structure of the system though, and doesn't always work (in particular, seems
to get skipped for large files). By using SHA-1, it gets very close to #4, but
decided to prepend the length of a file to the file's contents themselves
before hashing, so the git address of a blob does not match the usual SHA-1 of
the file.

The Dat protocol provides a weak version of #2, but no existing implementation
actually implements any form of de-duplication, even at the full-file level.
Eg, if you delete a file from a Dat archive and then re-add it later, the file
contents are duplicated in the content feed, even though the standard would
allow pointing back to the previous copy.

IPFS has a weak version of #1: the file digest is deterministic if the same
software version and configuration is used, 

## Challenges in Implementing These Features

Features #1 and #3 seem very difficult to reconcile. A frequent trick to
compress deltas between files is to take history into account, but using
history makes the resulting hash (name) history dependent. Robust,
deterministic, content-aware hashing is supposed enable both features at the
same time, which is exciting, but seems to have been abandoned by all existing
implementations because it's too slow.

## Tangled Hierarchies

git and other versioned storage systems are like catnip to programmers:
folks love to think about re-inventing "everything" on top of such a system. I
think this is because git supplies specific semantic features people love,
while being deeply entangled with files and file systems. Computer engingeering
is All About Files, and git is both made out of files (look in .git; it's
simple files and directories all the way down!) and accomodating files.

Consider:

- on UNIX systems, a block storage device is a fixed size bytestream; a big
  file, if you will. File systems on top of this are like an archive file
  format (eg, tar, zip).
- disk partitioning schemes (like GPT) and volume managers (like LVM) are
  basically the same thing as file archive formats (like .tar)
- a hypercore feed (which Dat is built upon) is a single long append-only
  bytestream: a growing file, if you will, and hyperdrive is a file system (or
  file format) on top of that.

There's a tangled hierarchy here, in the same way that (at least on UNIX), one
can create any variation of:

- a file...
- in an archive file-format (like .zip)...
- stored in a file-system (like ext4 or ISO)...
- serialized into a binary file...
- on another file system (perhaps NTFS)...
- residing in a partition...
- on a block device.

If we had a super-duper merkle-tree mechanism for storing files, and a
consistent way of serializing it to a single file, we write it directly to our
disk block devices, backup and synchronize file systems efficiently, etc.