From 8843802779baad62d4732668930a4ef446b70f43 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 14 Apr 2019 18:17:05 -0700 Subject: update merkle design --- posts/merkle-design.md | 72 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 62 insertions(+), 10 deletions(-) diff --git a/posts/merkle-design.md b/posts/merkle-design.md index dc1e271..e21da1a 100644 --- a/posts/merkle-design.md +++ b/posts/merkle-design.md @@ -1,19 +1,32 @@ -Title: Design Considerations for Merkle-Tree Storage Systems +Title: Nice Properties for Universal File Systems Author: bnewbold Date: 2018-06-10 Tags: tech, dweb -Status: draft -## Features we Want + -Four semantic properties we might want from a universal -content-addressiblestorage system: +This post digs in to some of the finer differences between content-addressible +file storage systems built on [Merkle trees][wp-merkle]: what fixity and +de-duplication properties do systems like git, IPFS, and Dat provide, and what +might we desire from future systems? -**1. Deterministic file-level addressability, enabling file-level efficiency -on-wire and at rest.** If every distinct file can be identified by only a -single, reproducible name, then discovery, indexing, and de-duplicaiton is made -easier. If the same file can end up with different names, then that file might -be transfered or stored separately by default; this creates pressure for the +My interest in these systems is as infrastructure for the commons of cultural +and intellectual works: making it cheap and easy to publish content that +becomes easily accessible and woven in to the web of reference and derivative +works. From my work at the Internet Archive collecting Open Access +publications and datasets, I have a particular interest in dereferencable links +(and/or citations) that will work in the future, + +[wp-merkle]: https://en.wikipedia.org/wiki/Merkle_tree + +Four properties we might want from a universal content-addressible storage +system: + +**1. Deterministic file-level addressability**, for efficiency at rest and over +the wire. If every distinct file can be identified by only a single, +reproducible name, then discovery, indexing, and de-duplicaiton is made easier. +If the same file can end up with different names, then that file might be +transfered or stored separately by default; this creates pressure for the application layer to support the concept of "many identifiers for the same file", and requires additional coordination at scale. @@ -70,6 +83,9 @@ deterministic, content-aware hashing is supposed enable both features at the same time, which is exciting, but seems to have been abandoned by all existing implementations because it's too slow. +Feature #4 is frequently lost due to protection against "second pre-image" +attacks against Merkle trees, which arise when XXX + ## Tangled Hierarchies git and other versioned storage systems are like catnip to programmers: @@ -105,3 +121,39 @@ If we had a super-duper merkle-tree mechanism for storing files, and a consistent way of serializing it to a single file, we write it directly to our disk block devices, backup and synchronize file systems efficiently, etc. +## REFS + +Best rabin description: + + https://restic.net/blog/2015-09-12/restic-foundation1-cdc + +Slides claiming 6 GB/sec for FastCDC rabin-like chunking: + + https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf + +rsync wikipedia: + + "Some compression programs, such as gzip, provide a special "rsyncable" + mode which allows these files to be efficiently rsynced, by ensuring that + local changes in the uncompressed file yield only local changes in the + compressed file." + +## TODO + +https://en.wikipedia.org/wiki/Casync +https://en.wikipedia.org/wiki/Rsync#Variations +https://en.wikipedia.org/wiki/Dar_(disk_archiver) +https://en.wikipedia.org/wiki/Comparison_of_backup_software + +rolling hash: +https://github.com/dbaarda/rollsum-tests +https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/ +https://restic.net/ +https://github.com/restic/chunker +http://www.xmailserver.org/rabin_apps.pdf +https://github.com/YADL/yadl/wiki +https://en.wikipedia.org/wiki/Rolling_hash +https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf +https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf + +https://bentrask.com/?q=hash://sha256/d9adca0ff04cefd44430173ac5460ebb23f5cfbc7637a295e391e050ab537ae9 -- cgit v1.2.3