Title: Nice Properties for Universal File Systems Author: bnewbold Date: 2018-06-10 Tags: tech, dweb Status: draft This post digs in to some of the finer differences between content-addressible file storage systems built on [Merkle trees][wp-merkle]: what fixity and de-duplication properties do systems like git, IPFS, and Dat provide, and what might we desire from future systems? My interest in these systems is as infrastructure for the commons of cultural and intellectual works: making it cheap and easy to publish content that becomes easily accessible and woven in to the web of reference and derivative works. From my work at the Internet Archive collecting Open Access publications and datasets, I have a particular interest in dereferenceable links (and/or citations) that will work in the future, [wp-merkle]: https://en.wikipedia.org/wiki/Merkle_tree Four properties we might want from a universal content-addressible storage system: **1. Deterministic file-level addressability**, for efficiency at rest and over the wire. If every distinct file can be identified by only a single, reproducible name, then discovery, indexing, and de-duplicaiton is made easier. If the same file can end up with different names, then that file might be transferred or stored separately by default; this creates pressure for the application layer to support the concept of "many identifiers for the same file", and requires additional coordination at scale. **2. Simple chunk-level de-duplication between files and versions.** This means that if you have two large files with only a few bytes *changed* between them, you don't need both full copies, only data proportional to the difference. A special case is when an existing file is copied and modified; in this case the system can track the history and change. This is distinct from adding two files at the same time and detecting that there is a large degree of overlap. **3. Offset-independent, chunk-level du-duplication between files.** Stronger than #2, this method is efficient even if the different between two files is one of inserting (or deleting) and *offset* of bytes; the challenge of detecting that two files are identical except for an offset is harder than that of identifying the identical bytes at the same locations. **4. File-level Interoperability with legacy and future systems.** Can the system be used as a transparent "layer" by other systems? Eg, can a thin proxy be implemented on top of existing file-systems and blob stores? Can thin file-system and blob-store gateways be layered on top of the storage system? A common source of friction here is when generic, off-the-shelf full-file hashes like SHA-1, SHA-256, or BLAKE2b are not used in a common manner. This last one doesn't matter if you are planning on Total World Domination with no need for future upgrades. ## Existing Implementations git nails #1 (at the cost of not having an upgrade path for the hash function). It contains implementation work-arounds for #2 and #3: an internal compression format allows storing and transmitting only the diffs between two versions of a file, instead of the file files themselves. This isn't baked in to the structure of the system though, and doesn't always work (in particular, seems to get skipped for large files). By using SHA-1, it gets very close to #4, but decided to prepend the length of a file to the file's contents themselves before hashing, so the git address of a blob does not match the usual SHA-1 of the file. The Dat protocol provides a weak version of #2, but no existing implementation actually implements any form of de-duplication, even at the full-file level. Eg, if you delete a file from a Dat archive and then re-add it later, the file contents are duplicated in the content feed, even though the standard would allow pointing back to the previous copy. IPFS has a weak version of #1: the file digest is deterministic if the same software version and configuration is used, ## Challenges in Implementing These Features Features #1 and #3 seem very difficult to reconcile. A frequent trick to compress deltas between files is to take history into account, but using history makes the resulting hash (name) history dependent. Robust, deterministic, content-aware hashing is supposed enable both features at the same time, which is exciting, but seems to have been abandoned by all existing implementations because it's too slow. Feature #4 is frequently lost due to protection against "second pre-image" attacks against Merkle trees, which arise when XXX ## Tangled Hierarchies git and other versioned storage systems are like catnip to programmers: folks love to think about re-inventing "everything" on top of such a system. I think this is because git supplies specific semantic features people love, while being deeply entangled with files and file systems. Computer engingeering is All About Files, and git is both made out of files (look in .git; it's simple files and directories all the way down!) and accommodating files. Consider: - on UNIX systems, a block storage device is a fixed size bytestream; a big file, if you will. File systems on top of this are like an archive file format (eg, tar, zip). - disk partitioning schemes (like GPT) and volume managers (like LVM) are basically the same thing as file archive formats (like .tar) - a hypercore feed (which Dat is built upon) is a single long append-only bytestream: a growing file, if you will, and hyperdrive is a file system (or file format) on top of that. There's a tangled hierarchy here, in the same way that (at least on UNIX), one can create any variation of: - a file... - in an archive file-format (like .zip)... - stored in a file-system (like ext4 or ISO)... - serialized into a binary file... - on another file system (perhaps NTFS)... - residing in a partition... - on a block device. If we had a super-duper merkle-tree mechanism for storing files, and a consistent way of serializing it to a single file, we write it directly to our disk block devices, backup and synchronize file systems efficiently, etc. ## REFS Best rabin description: https://restic.net/blog/2015-09-12/restic-foundation1-cdc Slides claiming 6 GB/sec for FastCDC rabin-like chunking: https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf rsync wikipedia: "Some compression programs, such as gzip, provide a special "rsyncable" mode which allows these files to be efficiently rsynced, by ensuring that local changes in the uncompressed file yield only local changes in the compressed file." ## TODO https://en.wikipedia.org/wiki/Casync https://en.wikipedia.org/wiki/Rsync#Variations https://en.wikipedia.org/wiki/Dar_(disk_archiver) https://en.wikipedia.org/wiki/Comparison_of_backup_software rolling hash: https://github.com/dbaarda/rollsum-tests https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/ https://restic.net/ https://github.com/restic/chunker http://www.xmailserver.org/rabin_apps.pdf https://github.com/YADL/yadl/wiki https://en.wikipedia.org/wiki/Rolling_hash https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf https://bentrask.com/?q=hash://sha256/d9adca0ff04cefd44430173ac5460ebb23f5cfbc7637a295e391e050ab537ae9