aboutsummaryrefslogtreecommitdiffstats
path: root/posts/merkle_design.md
diff options
context:
space:
mode:
Diffstat (limited to 'posts/merkle_design.md')
-rw-r--r--posts/merkle_design.md107
1 files changed, 107 insertions, 0 deletions
diff --git a/posts/merkle_design.md b/posts/merkle_design.md
new file mode 100644
index 0000000..dc1e271
--- /dev/null
+++ b/posts/merkle_design.md
@@ -0,0 +1,107 @@
+Title: Design Considerations for Merkle-Tree Storage Systems
+Author: bnewbold
+Date: 2018-06-10
+Tags: tech, dweb
+Status: draft
+
+## Features we Want
+
+Four semantic properties we might want from a universal
+content-addressiblestorage system:
+
+**1. Deterministic file-level addressability, enabling file-level efficiency
+on-wire and at rest.** If every distinct file can be identified by only a
+single, reproducible name, then discovery, indexing, and de-duplicaiton is made
+easier. If the same file can end up with different names, then that file might
+be transfered or stored separately by default; this creates pressure for the
+application layer to support the concept of "many identifiers for the same
+file", and requires additional coordination at scale.
+
+**2. Simple chunk-level de-duplication between files and versions.** This means
+that if you have two large files with only a few bytes *changed* between them,
+you don't need both full copies, only data proportional to the difference. A
+special case is when an existing file is copied and modified; in this case the
+system can track the history and change. This is distinct from adding two files
+at the same time and detecting that there is a large degree of overlap.
+
+**3. Offset-independent, chunk-level du-duplication between files.** Stronger
+than #2, this method is efficient even if the different between two files is
+one of inserting (or deleting) and *offset* of bytes; the challenge of
+detecting that two files are identical except for an offset is harder than that
+of identifying the identical bytes at the same locations.
+
+**4. File-level Interoperability with legacy and future systems.** Can the
+system be used as a transparent "layer" by other systems? Eg, can a thin proxy
+be implemented on top of existing file-systems and blob stores? Can thin
+file-system and blob-store gateways be layered on top of the storage system? A
+common source of friction here is when generic, off-the-shelf full-file hashes
+like SHA-1, SHA-256, or BLAKE2b are not used in a common manner.
+
+This last one doesn't matter if you are planning on Total World Domination with
+no need for future upgrades.
+
+## Existing Implementations
+
+git nails #1 (at the cost of not having an upgrade path for the hash function).
+It contains implementation work-arounds for #2 and #3: an internal compression
+format allows storing and transmitting only the diffs between two versions of a
+file, instead of the file files themselves. This isn't baked in to the
+structure of the system though, and doesn't always work (in particular, seems
+to get skipped for large files). By using SHA-1, it gets very close to #4, but
+decided to prepend the length of a file to the file's contents themselves
+before hashing, so the git address of a blob does not match the usual SHA-1 of
+the file.
+
+The Dat protocol provides a weak version of #2, but no existing implementation
+actually implements any form of de-duplication, even at the full-file level.
+Eg, if you delete a file from a Dat archive and then re-add it later, the file
+contents are duplicated in the content feed, even though the standard would
+allow pointing back to the previous copy.
+
+IPFS has a weak version of #1: the file digest is deterministic if the same
+software version and configuration is used,
+
+## Challenges in Implementing These Features
+
+Features #1 and #3 seem very difficult to reconcile. A frequent trick to
+compress deltas between files is to take history into account, but using
+history makes the resulting hash (name) history dependent. Robust,
+deterministic, content-aware hashing is supposed enable both features at the
+same time, which is exciting, but seems to have been abandoned by all existing
+implementations because it's too slow.
+
+## Tangled Hierarchies
+
+git and other versioned storage systems are like catnip to programmers:
+folks love to think about re-inventing "everything" on top of such a system. I
+think this is because git supplies specific semantic features people love,
+while being deeply entangled with files and file systems. Computer engingeering
+is All About Files, and git is both made out of files (look in .git; it's
+simple files and directories all the way down!) and accomodating files.
+
+Consider:
+
+- on UNIX systems, a block storage device is a fixed size bytestream; a big
+ file, if you will. File systems on top of this are like an archive file
+ format (eg, tar, zip).
+- disk partitioning schemes (like GPT) and volume managers (like LVM) are
+ basically the same thing as file archive formats (like .tar)
+- a hypercore feed (which Dat is built upon) is a single long append-only
+ bytestream: a growing file, if you will, and hyperdrive is a file system (or
+ file format) on top of that.
+
+There's a tangled hierarchy here, in the same way that (at least on UNIX), one
+can create any variation of:
+
+- a file...
+- in an archive file-format (like .zip)...
+- stored in a file-system (like ext4 or ISO)...
+- serialized into a binary file...
+- on another file system (perhaps NTFS)...
+- residing in a partition...
+- on a block device.
+
+If we had a super-duper merkle-tree mechanism for storing files, and a
+consistent way of serializing it to a single file, we write it directly to our
+disk block devices, backup and synchronize file systems efficiently, etc.
+