From 3f1b3fc4d2989eb2c7e1c09184dcf56f8c2e4c15 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 14 Mar 2018 22:15:21 -0700 Subject: start documenting hash collision procedure --- proposals/0000-hyperdb.md | 62 ++++++++++++++++++++++++++++------------------- 1 file changed, 37 insertions(+), 25 deletions(-) diff --git a/proposals/0000-hyperdb.md b/proposals/0000-hyperdb.md index 94e9da6..6de1114 100644 --- a/proposals/0000-hyperdb.md +++ b/proposals/0000-hyperdb.md @@ -182,7 +182,11 @@ TODO(mafintosh): should `feed` actually be the current hypercore publickey? Every key path has component-wise fixed-size hash representation that is used by the trie. This is written to the `path` field each Node protobuf -message. +message. The concatenation of all path hashes yields a hash for the entire key. +Note that analagously to a hash map data structure, there can be more than one +key (string) with the same key hash in the same database with no problems: the +hash points to a bucket of keys, which can be iterated over linearly to find +the correct value. The path hash is represented by an array of bytes. Elements are 2-bit encoded (values 0, 1, 2, 3), except for an optional terminating element which has value @@ -194,9 +198,9 @@ element hashes plus a terminator). The hash algorithm used is `SipHash-2-4`, with an 8-byte output and 16-byte key; the input is the UTF-8 encoded path string segment, without any `/` separators or terminating null bytes. An implementation of this hash -algorithm is included in the libsodium library, under the function -`crypto_shorthash()`. A 16-byte "secret" key is required; for this use case we -use all zeros. +algorithm is included in the libsodium library in the form of the +`crypto_shorthash()` function. A 16-byte "secret" key is required; for this use +case we use all zeros. When converting the 8-bytehash to an array of 2-bit bytes, the ordering is proceed byte-by-byte, and for each take the two lowest-value bits (aka, `hash & @@ -229,6 +233,12 @@ As another example, the key `/a/b/c` converts into the 97-byte hash array: 0, 1, 1, 0, 1, 2, 3, 2, 2, 2, 0, 0, 3, 1, 2, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 2, 3, 2, 3, 0, 1, 0, 4 ] +Note that "hash collisions" are rare with this hashing scheme, but are likely +to occur with large databases (millions of keys), and collision have been +generated as a proof of concept. Implementions should take care to properly +handle collisions by verifying keys and following bucket pointers (see the next +section). +