From 8843802779baad62d4732668930a4ef446b70f43 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Sun, 14 Apr 2019 18:17:05 -0700
Subject: update merkle design

---
 posts/merkle-design.md | 72 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 62 insertions(+), 10 deletions(-)

diff --git a/posts/merkle-design.md b/posts/merkle-design.md
index dc1e271..e21da1a 100644
--- a/posts/merkle-design.md
+++ b/posts/merkle-design.md
@@ -1,19 +1,32 @@
-Title: Design Considerations for Merkle-Tree Storage Systems
+Title: Nice Properties for Universal File Systems
 Author: bnewbold
 Date: 2018-06-10
 Tags: tech, dweb
-Status: draft
 
-## Features we Want
+<!-- Title: Design Considerations for Merkle-Tree Storage Systems -->
 
-Four semantic properties we might want from a universal
-content-addressiblestorage system:
+This post digs in to some of the finer differences between content-addressible
+file storage systems built on [Merkle trees][wp-merkle]: what fixity and
+de-duplication properties do systems like git, IPFS, and Dat provide, and what
+might we desire from future systems?
 
-**1. Deterministic file-level addressability, enabling file-level efficiency
-on-wire and at rest.** If every distinct file can be identified by only a
-single, reproducible name, then discovery, indexing, and de-duplicaiton is made
-easier.  If the same file can end up with different names, then that file might
-be transfered or stored separately by default; this creates pressure for the
+My interest in these systems is as infrastructure for the commons of cultural
+and intellectual works: making it cheap and easy to publish content that
+becomes easily accessible and woven in to the web of reference and derivative
+works. From my work at the Internet Archive collecting Open Access
+publications and datasets, I have a particular interest in dereferencable links
+(and/or citations) that will work in the future,
+
+[wp-merkle]: https://en.wikipedia.org/wiki/Merkle_tree
+
+Four properties we might want from a universal content-addressible storage
+system:
+
+**1. Deterministic file-level addressability**, for efficiency at rest and over
+the wire.  If every distinct file can be identified by only a single,
+reproducible name, then discovery, indexing, and de-duplicaiton is made easier.
+If the same file can end up with different names, then that file might be
+transfered or stored separately by default; this creates pressure for the
 application layer to support the concept of "many identifiers for the same
 file", and requires additional coordination at scale.
 
@@ -70,6 +83,9 @@ deterministic, content-aware hashing is supposed enable both features at the
 same time, which is exciting, but seems to have been abandoned by all existing
 implementations because it's too slow.
 
+Feature #4 is frequently lost due to protection against "second pre-image"
+attacks against Merkle trees, which arise when XXX
+
 ## Tangled Hierarchies
 
 git and other versioned storage systems are like catnip to programmers:
@@ -105,3 +121,39 @@ If we had a super-duper merkle-tree mechanism for storing files, and a
 consistent way of serializing it to a single file, we write it directly to our
 disk block devices, backup and synchronize file systems efficiently, etc.
 
+## REFS
+
+Best rabin description:
+
+    https://restic.net/blog/2015-09-12/restic-foundation1-cdc
+
+Slides claiming 6 GB/sec for FastCDC rabin-like chunking:
+
+    https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf
+
+rsync wikipedia:
+
+    "Some compression programs, such as gzip, provide a special "rsyncable"
+    mode which allows these files to be efficiently rsynced, by ensuring that
+    local changes in the uncompressed file yield only local changes in the
+    compressed file."
+
+## TODO
+
+https://en.wikipedia.org/wiki/Casync
+https://en.wikipedia.org/wiki/Rsync#Variations
+https://en.wikipedia.org/wiki/Dar_(disk_archiver)
+https://en.wikipedia.org/wiki/Comparison_of_backup_software
+
+rolling hash:
+https://github.com/dbaarda/rollsum-tests
+https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
+https://restic.net/
+https://github.com/restic/chunker
+http://www.xmailserver.org/rabin_apps.pdf
+https://github.com/YADL/yadl/wiki
+https://en.wikipedia.org/wiki/Rolling_hash
+https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf
+https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf
+
+https://bentrask.com/?q=hash://sha256/d9adca0ff04cefd44430173ac5460ebb23f5cfbc7637a295e391e050ab537ae9
-- 
cgit v1.2.3