aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-04-14 18:17:05 -0700
committerBryan Newbold <bnewbold@archive.org>2019-04-14 18:17:05 -0700
commit8843802779baad62d4732668930a4ef446b70f43 (patch)
tree95f81f471b99ace2a51c42ab10388da623003a34
parentcee4a5393132cc1bfb4733281fd5c46392a9586f (diff)
downloadbnewnet-8843802779baad62d4732668930a4ef446b70f43.tar.gz
bnewnet-8843802779baad62d4732668930a4ef446b70f43.zip
update merkle design
-rw-r--r--posts/merkle-design.md72
1 files changed, 62 insertions, 10 deletions
diff --git a/posts/merkle-design.md b/posts/merkle-design.md
index dc1e271..e21da1a 100644
--- a/posts/merkle-design.md
+++ b/posts/merkle-design.md
@@ -1,19 +1,32 @@
-Title: Design Considerations for Merkle-Tree Storage Systems
+Title: Nice Properties for Universal File Systems
Author: bnewbold
Date: 2018-06-10
Tags: tech, dweb
-Status: draft
-## Features we Want
+<!-- Title: Design Considerations for Merkle-Tree Storage Systems -->
-Four semantic properties we might want from a universal
-content-addressiblestorage system:
+This post digs in to some of the finer differences between content-addressible
+file storage systems built on [Merkle trees][wp-merkle]: what fixity and
+de-duplication properties do systems like git, IPFS, and Dat provide, and what
+might we desire from future systems?
-**1. Deterministic file-level addressability, enabling file-level efficiency
-on-wire and at rest.** If every distinct file can be identified by only a
-single, reproducible name, then discovery, indexing, and de-duplicaiton is made
-easier. If the same file can end up with different names, then that file might
-be transfered or stored separately by default; this creates pressure for the
+My interest in these systems is as infrastructure for the commons of cultural
+and intellectual works: making it cheap and easy to publish content that
+becomes easily accessible and woven in to the web of reference and derivative
+works. From my work at the Internet Archive collecting Open Access
+publications and datasets, I have a particular interest in dereferencable links
+(and/or citations) that will work in the future,
+
+[wp-merkle]: https://en.wikipedia.org/wiki/Merkle_tree
+
+Four properties we might want from a universal content-addressible storage
+system:
+
+**1. Deterministic file-level addressability**, for efficiency at rest and over
+the wire. If every distinct file can be identified by only a single,
+reproducible name, then discovery, indexing, and de-duplicaiton is made easier.
+If the same file can end up with different names, then that file might be
+transfered or stored separately by default; this creates pressure for the
application layer to support the concept of "many identifiers for the same
file", and requires additional coordination at scale.
@@ -70,6 +83,9 @@ deterministic, content-aware hashing is supposed enable both features at the
same time, which is exciting, but seems to have been abandoned by all existing
implementations because it's too slow.
+Feature #4 is frequently lost due to protection against "second pre-image"
+attacks against Merkle trees, which arise when XXX
+
## Tangled Hierarchies
git and other versioned storage systems are like catnip to programmers:
@@ -105,3 +121,39 @@ If we had a super-duper merkle-tree mechanism for storing files, and a
consistent way of serializing it to a single file, we write it directly to our
disk block devices, backup and synchronize file systems efficiently, etc.
+## REFS
+
+Best rabin description:
+
+ https://restic.net/blog/2015-09-12/restic-foundation1-cdc
+
+Slides claiming 6 GB/sec for FastCDC rabin-like chunking:
+
+ https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf
+
+rsync wikipedia:
+
+ "Some compression programs, such as gzip, provide a special "rsyncable"
+ mode which allows these files to be efficiently rsynced, by ensuring that
+ local changes in the uncompressed file yield only local changes in the
+ compressed file."
+
+## TODO
+
+https://en.wikipedia.org/wiki/Casync
+https://en.wikipedia.org/wiki/Rsync#Variations
+https://en.wikipedia.org/wiki/Dar_(disk_archiver)
+https://en.wikipedia.org/wiki/Comparison_of_backup_software
+
+rolling hash:
+https://github.com/dbaarda/rollsum-tests
+https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/
+https://restic.net/
+https://github.com/restic/chunker
+http://www.xmailserver.org/rabin_apps.pdf
+https://github.com/YADL/yadl/wiki
+https://en.wikipedia.org/wiki/Rolling_hash
+https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf
+https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf
+
+https://bentrask.com/?q=hash://sha256/d9adca0ff04cefd44430173ac5460ebb23f5cfbc7637a295e391e050ab537ae9