From f122772b79da57fedfdb1f4a947d8303122c20ba Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sat, 17 Mar 2018 21:16:58 -0700 Subject: hyperdrive hashes first draft --- proposals/0000-hyperdrive-hashes.md | 178 ++++++++++++++++++++++++++++++++++++ 1 file changed, 178 insertions(+) create mode 100644 proposals/0000-hyperdrive-hashes.md diff --git a/proposals/0000-hyperdrive-hashes.md b/proposals/0000-hyperdrive-hashes.md new file mode 100644 index 0000000..2455ed9 --- /dev/null +++ b/proposals/0000-hyperdrive-hashes.md @@ -0,0 +1,178 @@ + +Title: **DEP-0000: Hyperdrive File Hashes** + +Short Name: `0000-hyperdrive-hashes` + +Type: Standard + +Status: Undefined (as of YYYY-MM-DD) + +Github PR: (add HTTPS link here after PR is opened) + +Authors: [Bryan Newbold](https://github.com/bnewbold) + + +# Summary +[summary]: #summary + +Full-file hashes are optionally included in hyperdrive metadata to complement +the existing cryptographic-strength hashing of sub-file chunks. Multiple +popular hash algorithms can be included at the same time. + + +# Motivation +[motivation]: #motivation + +Naming, discovering, and cataloging data "by content" (aka, by a fixed-size +hashes of the data) is a powerful pattern for robust distributed systems. Dat +is one among several such systems. Unfortunately, interoperability between or +layering such systems on top of each other is difficult because each tends to +adopt it's own hashing norms and formats. Design variances can include hash +algorithm selection, hash configuration, salting, data chunking, and +intermediate Merkle tree data formats. + +As a concrete example, the sha1sum command-line tool, the bittorrent P2P +protocol and the git code versioning software both use the SHA-1 algorithm to +hash file contents. However, one can not use the simple `sha1sum` hash of a +given file to check whether that file is the same as referenced in either a +bittorrent `.torrent` file or from git metadata, because each calculate the +hash in different ways. Bittorrent combines all files in the torrent into a +single stream, then splits into a fixed number of chunks and hashes those +separately; the chunk boundaries usually do not correspond to individual files. +git prepends the size of the file (in bytes) as a fixed header before hashing +and storing the file as a "blob". This makes comparison or interoperability +between these systems impossible without having either a universal cross-hash +table (infeasible to build in the general sense) or without having the full +file contents on-hand to compare or re-hash in all three formats. + +The design decisions to adopt hash variants are usually well-founded, motivated +by security concerns (such as pre-image attacks), efficiency, and +implementation concerns. + +By adding simple full-file hashes of files as optional complementary metadata +in our distributed data systems, we can make interoperability and powerful +efficiency gains possible. + +For example, a large collection of files could be stored in a simple format on +disk, indexed by a popular hash format. Gateway clients to several P2P networks +could make the same files accessible by storing metadata (relatively small) +separately for each network, but accessing the file contents from the shared +store by a common hash. + +In the case of Dat, a particular efficiency of this use case would be enabling +fast de-duplication of file storage between multiple Dat archives on a +full-file level, instead of at the chunk-level (which would be sensitive to +changes in chunking algorithm). + + +# Usage Documentation +[usage-documentation]: #usage-documentation + +Implementations would include hashes as file-level metadata along with existing +"stat" fields. + +Existing API methods would include options to control generation of hashes (and +which types) when creating a new drive or adding files. + + +# Reference Documentation +[reference-documentation]: #reference-documentation + +Hashes would be stored as additional fields in hyperdrive's existing `Stat` +protobuf message, with the following structure: + +```protobuf +message Stat { + message ExtraHash { + required uint32 type = 1; + required bytes value = 2; + } + required uint32 mode = 1; + optional uint32 uid = 2; + optional uint32 gid = 3; + optional uint64 size = 4; + optional uint64 blocks = 5; + optional uint64 offset = 6; + optional uint64 byteOffset = 7; + optional uint64 mtime = 8; + optional uint64 ctime = 9; + repeated ExtraHash hashes = 10; +} +``` + +`type` is a number representing the hash algorithm, and `value` is the +bytestring of the hash output itself. The length of the hash digest (in bytes) +is available from protobuf metadata for the value. This scheme, and the `type` +value table, is intended to be interoperable with the [multihash][multihash] +scheme from the IPFS community. + +A subset of the multihash hash digest table includes: + +``` + md5 0x00D5 + sha1 0x0011 + sha2-256 0x0012 + sha2-512 0x0013 + blake2b-256 0xB220 +``` + +Multiple hashes would be calculated in parallel with the existing +chunking/hashing process, in a streaming fashion. Final hashes would be +calculated when the chunking is complete, and included in the `Stat` metadata. + +For 2018, recommended default full-file hash functions to include are `SHA1` +(for popularity and interoperability) and `blake2b-256` (already used in other +parts of the Dat protocol stack). + +[multihash]: https://multiformats.io/multihash/ + + +# Drawbacks +[drawbacks]: #drawbacks + +The metadata storage overhead (on a per-file basis) should be minimal, but the +additional computational resources to hash a large file multiple times are +non-trivial on machines with a single (or few) cores, even when computed in a +parallel/streaming format. + + +# Security and Privacy Concerns +[privacy]: #privacy + +Additional optional fields may leak additional bits of user-specific +configuration metadata, analogous to the "[evercookie][]" and +"[panopticlick][]" browser fingerprinting issues. + +[evercookie]: https://en.wikipedia.org/wiki/Evercookie +[panopticlick]: https://panopticlick.eff.org/ + + +# Rationale and alternatives +[alternatives]: #alternatives + +Users wanting this metadata could instead maintain a manifest file (mapping +paths to hashes) inside the Dat archive itself. The Dat client could support +this with a special mode or flag. One downside of this is that for large +archives, the file would need to be updated and duplicated for every new or +modified file. + + +# Unresolved questions +[unresolved]: #unresolved-questions + +What does the user-facing API look like, specifically? + +Should we allow non-standard hashes, like the git "hash", or higher-level +references like (single-file) bittorrent magnet links or IPFS file references? + +Modifying a small part of a large file would require re-hashing the entire +file, which is slow. Should we skip including the updated hashes in this case? +Currently mitigated by the fact that we duplicate the entire file when recoding +changes or additions. + + +# Changelog +[changelog]: #changelog + +- 2018-03-17: First draft for comment. + -- cgit v1.2.3