aboutsummaryrefslogtreecommitdiffstats
path: root/posts/merkle-design.md
blob: e21da1a90b87f2720637f907ba7809c38e45dd63 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
Title: Nice Properties for Universal File Systems
Author: bnewbold
Date: 2018-06-10
Tags: tech, dweb

<!-- Title: Design Considerations for Merkle-Tree Storage Systems -->

This post digs in to some of the finer differences between content-addressible
file storage systems built on [Merkle trees][wp-merkle]: what fixity and
de-duplication properties do systems like git, IPFS, and Dat provide, and what
might we desire from future systems?

My interest in these systems is as infrastructure for the commons of cultural
and intellectual works: making it cheap and easy to publish content that
becomes easily accessible and woven in to the web of reference and derivative
works. From my work at the Internet Archive collecting Open Access
publications and datasets, I have a particular interest in dereferencable links
(and/or citations) that will work in the future,

[wp-merkle]: https://en.wikipedia.org/wiki/Merkle_tree

Four properties we might want from a universal content-addressible storage
system:

**1. Deterministic file-level addressability**, for efficiency at rest and over
the wire.  If every distinct file can be identified by only a single,
reproducible name, then discovery, indexing, and de-duplicaiton is made easier.
If the same file can end up with different names, then that file might be
transfered or stored separately by default; this creates pressure for the
application layer to support the concept of "many identifiers for the same
file", and requires additional coordination at scale.

**2. Simple chunk-level de-duplication between files and versions.** This means
that if you have two large files with only a few bytes *changed* between them,
you don't need both full copies, only data proportional to the difference. A
special case is when an existing file is copied and modified; in this case the
system can track the history and change. This is distinct from adding two files
at the same time and detecting that there is a large degree of overlap.

**3. Offset-independent, chunk-level du-duplication between files.** Stronger
than #2, this method is efficient even if the different between two files is
one of inserting (or deleting) and *offset* of bytes; the challenge of
detecting that two files are identical except for an offset is harder than that
of identifying the identical bytes at the same locations.

**4. File-level Interoperability with legacy and future systems.** Can the
system be used as a transparent "layer" by other systems? Eg, can a thin proxy
be implemented on top of existing file-systems and blob stores? Can thin
file-system and blob-store gateways be layered on top of the storage system? A
common source of friction here is when generic, off-the-shelf full-file hashes
like SHA-1, SHA-256, or BLAKE2b are not used in a common manner.

This last one doesn't matter if you are planning on Total World Domination with
no need for future upgrades.

## Existing Implementations

git nails #1 (at the cost of not having an upgrade path for the hash function).
It contains implementation work-arounds for #2 and #3: an internal compression
format allows storing and transmitting only the diffs between two versions of a
file, instead of the file files themselves. This isn't baked in to the
structure of the system though, and doesn't always work (in particular, seems
to get skipped for large files). By using SHA-1, it gets very close to #4, but
decided to prepend the length of a file to the file's contents themselves
before hashing, so the git address of a blob does not match the usual SHA-1 of
the file.

The Dat protocol provides a weak version of #2, but no existing implementation
actually implements any form of de-duplication, even at the full-file level.
Eg, if you delete a file from a Dat archive and then re-add it later, the file
contents are duplicated in the content feed, even though the standard would
allow pointing back to the previous copy.

IPFS has a weak version of #1: the file digest is deterministic if the same
software version and configuration is used, 

## Challenges in Implementing These Features

Features #1 and #3 seem very difficult to reconcile. A frequent trick to
compress deltas between files is to take history into account, but using
history makes the resulting hash (name) history dependent. Robust,
deterministic, content-aware hashing is supposed enable both features at the
same time, which is exciting, but seems to have been abandoned by all existing
implementations because it's too slow.

Feature #4 is frequently lost due to protection against "second pre-image"
attacks against Merkle trees, which arise when XXX

## Tangled Hierarchies

git and other versioned storage systems are like catnip to programmers:
folks love to think about re-inventing "everything" on top of such a system. I
think this is because git supplies specific semantic features people love,
while being deeply entangled with files and file systems. Computer engingeering
is All About Files, and git is both made out of files (look in .git; it's
simple files and directories all the way down!) and accomodating files.

Consider:

- on UNIX systems, a block storage device is a fixed size bytestream; a big
  file, if you will. File systems on top of this are like an archive file
  format (eg, tar, zip).
- disk partitioning schemes (like GPT) and volume managers (like LVM) are
  basically the same thing as file archive formats (like .tar)
- a hypercore feed (which Dat is built upon) is a single long append-only
  bytestream: a growing file, if you will, and hyperdrive is a file system (or
  file format) on top of that.

There's a tangled hierarchy here, in the same way that (at least on UNIX), one
can create any variation of:

- a file...
- in an archive file-format (like .zip)...
- stored in a file-system (like ext4 or ISO)...
- serialized into a binary file...
- on another file system (perhaps NTFS)...
- residing in a partition...
- on a block device.

If we had a super-duper merkle-tree mechanism for storing files, and a
consistent way of serializing it to a single file, we write it directly to our
disk block devices, backup and synchronize file systems efficiently, etc.

## REFS

Best rabin description:

    https://restic.net/blog/2015-09-12/restic-foundation1-cdc

Slides claiming 6 GB/sec for FastCDC rabin-like chunking:

    https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf

rsync wikipedia:

    "Some compression programs, such as gzip, provide a special "rsyncable"
    mode which allows these files to be efficiently rsynced, by ensuring that
    local changes in the uncompressed file yield only local changes in the
    compressed file."

## TODO

https://en.wikipedia.org/wiki/Casync
https://en.wikipedia.org/wiki/Rsync#Variations
https://en.wikipedia.org/wiki/Dar_(disk_archiver)
https://en.wikipedia.org/wiki/Comparison_of_backup_software

rolling hash:
https://github.com/dbaarda/rollsum-tests
https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
https://restic.net/
https://github.com/restic/chunker
http://www.xmailserver.org/rabin_apps.pdf
https://github.com/YADL/yadl/wiki
https://en.wikipedia.org/wiki/Rolling_hash
https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf
https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf

https://bentrask.com/?q=hash://sha256/d9adca0ff04cefd44430173ac5460ebb23f5cfbc7637a295e391e050ab537ae9