2 files changed, 23 insertions, 14 deletions
@@ -48,6 +48,7 @@ Some notable algorithms to consider are:
- simhash (aka, Charikar hashing)
- weighed minhash
+- shingle methods (with dozens of feature hashes per document)
However, the papers describing these algorithms have a few parameters (eg, hash
bit length, token hash algorithm), and leave several important steps up to the
@@ -110,34 +111,39 @@ Proposed extraction and tokenization:
- specifically, no zero-width or non-printing unicode modifiers
- numbers (unless part of an alphanumeric string, eg an acronym) should not be
-- TODO: instead, strip all numeric characters?
+- TODO: instead, just strip all numeric characters?
- OPTIONALLY, a language-specific stop-list appropriate for search-engine
indexing may be used.
Proposed algorithm and parameters:
- `simhash` algorithm
-- TODO: fast token hashing algorithm?
-- TODO: 64x signed 8-bit buckets during calculation
-- 64bit length final form
-- base32 (non-capitalization-sensitive) when serialized as a string
-- TODO: little-endian? does base32 specify this?
+- 64-bit `jenkins` hash algorithm for tokens
+- 64x signed buckets (of 64-bit depth) during calculation
+- 64-bit final form
+- base32 when serialized as a string (non-capitalization-sensitive)
-Recommended lookup parameters:
+Note: endianness does not need to be specified, as the 64bit feature hashes are
+defined as 8-byte octet strings, not 64-bit integers.
-- TODO: how many columns in rotation?
-- k=3 (number of bits to filter on)
-- TODO: Hamming distance
-- TODO: 0.XYZ considered a loose match, 0.XYZ considered a close match
+Recommended default lookup parameters (these are all optional):
+- k=3 ("distance", or number of differing bits to allow when filtering)
+- Hamming distance as a similarity metric (for speed)
+- 0.90 considered a loose match, 0.98 considered a close match
+- create 16 tables of 28-bits for lookup, by first chosing 1 out of 4 16-bit
+ blocks, then subdividing the remaining 48 bits into 4 12-bit blocks and
+ choosing one of them, as described in Manku et al 3.1.1.
## Existing Libraries and Implementations
These don't necessarily implement the schemes above, but they do implement the
-- [seomoz/simhash-cpp](https://github.com/seomoz/simhash-cpp) (C++): simhash,
- used in production for web de-dupe
-- [simhash](https://github.com/JanX2/simhash) (C++): Google's simhash
+- [seomoz/simhash-cpp](https://github.com/seomoz/simhash-cpp) (C++) and
+ [seomoz/simhash-py](https://github.com/seomoz/simhash-py) (python, wrapping
+ the C++ library): simhash, used in production for web de-duplication
+- [simhash](https://github.com/JanX2/simhash) (C++): mirror of Google's simhash
- [datasketch](https://ekzhu.github.io/datasketch/) (Python): minhash, Jaccard
similarity, minhash-based indexing with Jaccard threshold, Top-K, or
@@ -166,6 +172,8 @@ Other resources:
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing,
ACM Press, 2002"
+> The original simhash paper
"Manku, Jain, Sarma: Detecting Near-Duplicates for Web Crawling". in
Proceedings of the 16th international conference on World Wide Web, ACM Press,
@@ -1,2 +1,3 @@
- small set of example files, extracted text, tokenized, token-hashes, and
complete hashes (in this repo)
+- package moz's simhash-cpp for debian