diff options
Diffstat (limited to 'posts/merkle_design.md')
-rw-r--r-- | posts/merkle_design.md | 107 |
1 files changed, 107 insertions, 0 deletions
diff --git a/posts/merkle_design.md b/posts/merkle_design.md new file mode 100644 index 0000000..dc1e271 --- /dev/null +++ b/posts/merkle_design.md @@ -0,0 +1,107 @@ +Title: Design Considerations for Merkle-Tree Storage Systems +Author: bnewbold +Date: 2018-06-10 +Tags: tech, dweb +Status: draft + +## Features we Want + +Four semantic properties we might want from a universal +content-addressiblestorage system: + +**1. Deterministic file-level addressability, enabling file-level efficiency +on-wire and at rest.** If every distinct file can be identified by only a +single, reproducible name, then discovery, indexing, and de-duplicaiton is made +easier. If the same file can end up with different names, then that file might +be transfered or stored separately by default; this creates pressure for the +application layer to support the concept of "many identifiers for the same +file", and requires additional coordination at scale. + +**2. Simple chunk-level de-duplication between files and versions.** This means +that if you have two large files with only a few bytes *changed* between them, +you don't need both full copies, only data proportional to the difference. A +special case is when an existing file is copied and modified; in this case the +system can track the history and change. This is distinct from adding two files +at the same time and detecting that there is a large degree of overlap. + +**3. Offset-independent, chunk-level du-duplication between files.** Stronger +than #2, this method is efficient even if the different between two files is +one of inserting (or deleting) and *offset* of bytes; the challenge of +detecting that two files are identical except for an offset is harder than that +of identifying the identical bytes at the same locations. + +**4. File-level Interoperability with legacy and future systems.** Can the +system be used as a transparent "layer" by other systems? Eg, can a thin proxy +be implemented on top of existing file-systems and blob stores? Can thin +file-system and blob-store gateways be layered on top of the storage system? A +common source of friction here is when generic, off-the-shelf full-file hashes +like SHA-1, SHA-256, or BLAKE2b are not used in a common manner. + +This last one doesn't matter if you are planning on Total World Domination with +no need for future upgrades. + +## Existing Implementations + +git nails #1 (at the cost of not having an upgrade path for the hash function). +It contains implementation work-arounds for #2 and #3: an internal compression +format allows storing and transmitting only the diffs between two versions of a +file, instead of the file files themselves. This isn't baked in to the +structure of the system though, and doesn't always work (in particular, seems +to get skipped for large files). By using SHA-1, it gets very close to #4, but +decided to prepend the length of a file to the file's contents themselves +before hashing, so the git address of a blob does not match the usual SHA-1 of +the file. + +The Dat protocol provides a weak version of #2, but no existing implementation +actually implements any form of de-duplication, even at the full-file level. +Eg, if you delete a file from a Dat archive and then re-add it later, the file +contents are duplicated in the content feed, even though the standard would +allow pointing back to the previous copy. + +IPFS has a weak version of #1: the file digest is deterministic if the same +software version and configuration is used, + +## Challenges in Implementing These Features + +Features #1 and #3 seem very difficult to reconcile. A frequent trick to +compress deltas between files is to take history into account, but using +history makes the resulting hash (name) history dependent. Robust, +deterministic, content-aware hashing is supposed enable both features at the +same time, which is exciting, but seems to have been abandoned by all existing +implementations because it's too slow. + +## Tangled Hierarchies + +git and other versioned storage systems are like catnip to programmers: +folks love to think about re-inventing "everything" on top of such a system. I +think this is because git supplies specific semantic features people love, +while being deeply entangled with files and file systems. Computer engingeering +is All About Files, and git is both made out of files (look in .git; it's +simple files and directories all the way down!) and accomodating files. + +Consider: + +- on UNIX systems, a block storage device is a fixed size bytestream; a big + file, if you will. File systems on top of this are like an archive file + format (eg, tar, zip). +- disk partitioning schemes (like GPT) and volume managers (like LVM) are + basically the same thing as file archive formats (like .tar) +- a hypercore feed (which Dat is built upon) is a single long append-only + bytestream: a growing file, if you will, and hyperdrive is a file system (or + file format) on top of that. + +There's a tangled hierarchy here, in the same way that (at least on UNIX), one +can create any variation of: + +- a file... +- in an archive file-format (like .zip)... +- stored in a file-system (like ext4 or ISO)... +- serialized into a binary file... +- on another file system (perhaps NTFS)... +- residing in a partition... +- on a block device. + +If we had a super-duper merkle-tree mechanism for storing files, and a +consistent way of serializing it to a single file, we write it directly to our +disk block devices, backup and synchronize file systems efficiently, etc. + |