diff options
-rw-r--r-- | README.md | 38 | ||||
-rw-r--r-- | TODO | 1 | ||||
-rw-r--r-- | papers/buttler_structural_similarity.pdf | bin | 0 -> 343333 bytes | |||
-rw-r--r-- | papers/charikar_simhash.pdf | bin | 0 -> 158779 bytes | |||
-rw-r--r-- | papers/citeseerx_simhash_shingle.pdf | bin | 0 -> 495992 bytes | |||
-rw-r--r-- | papers/defense_minhash_simhash.pdf | bin | 0 -> 422080 bytes | |||
-rw-r--r-- | papers/google_simhash.pdf | bin | 0 -> 157615 bytes | |||
-rw-r--r-- | papers/redundancy_elimination_large_collection.pdf | bin | 0 -> 183552 bytes | |||
-rw-r--r-- | papers/scalable_document_fingerprinting_abstract.pdf | bin | 0 -> 167253 bytes | |||
-rw-r--r-- | updates_20200525.md | 25 |
10 files changed, 50 insertions, 14 deletions
@@ -48,6 +48,7 @@ Some notable algorithms to consider are: - simhash (aka, Charikar hashing) - minhash - weighed minhash +- shingle methods (with dozens of feature hashes per document) However, the papers describing these algorithms have a few parameters (eg, hash bit length, token hash algorithm), and leave several important steps up to the @@ -110,34 +111,39 @@ Proposed extraction and tokenization: - specifically, no zero-width or non-printing unicode modifiers - numbers (unless part of an alphanumeric string, eg an acronym) should not be included -- TODO: instead, strip all numeric characters? +- TODO: instead, just strip all numeric characters? - OPTIONALLY, a language-specific stop-list appropriate for search-engine indexing may be used. Proposed algorithm and parameters: - `simhash` algorithm -- TODO: fast token hashing algorithm? -- TODO: 64x signed 8-bit buckets during calculation -- 64bit length final form -- base32 (non-capitalization-sensitive) when serialized as a string -- TODO: little-endian? does base32 specify this? +- 64-bit `jenkins` hash algorithm for tokens +- 64x signed buckets (of 64-bit depth) during calculation +- 64-bit final form +- base32 when serialized as a string (non-capitalization-sensitive) -Recommended lookup parameters: +Note: endianness does not need to be specified, as the 64bit feature hashes are +defined as 8-byte octet strings, not 64-bit integers. -- TODO: how many columns in rotation? -- k=3 (number of bits to filter on) -- TODO: Hamming distance -- TODO: 0.XYZ considered a loose match, 0.XYZ considered a close match +Recommended default lookup parameters (these are all optional): + +- k=3 ("distance", or number of differing bits to allow when filtering) +- Hamming distance as a similarity metric (for speed) +- 0.90 considered a loose match, 0.98 considered a close match +- create 16 tables of 28-bits for lookup, by first chosing 1 out of 4 16-bit + blocks, then subdividing the remaining 48 bits into 4 12-bit blocks and + choosing one of them, as described in Manku et al 3.1.1. ## Existing Libraries and Implementations These don't necessarily implement the schemes above, but they do implement the basic algorithms. -- [seomoz/simhash-cpp](https://github.com/seomoz/simhash-cpp) (C++): simhash, - used in production for web de-dupe -- [simhash](https://github.com/JanX2/simhash) (C++): Google's simhash +- [seomoz/simhash-cpp](https://github.com/seomoz/simhash-cpp) (C++) and + [seomoz/simhash-py](https://github.com/seomoz/simhash-py) (python, wrapping + the C++ library): simhash, used in production for web de-duplication +- [simhash](https://github.com/JanX2/simhash) (C++): mirror of Google's simhash implementation (old?) - [datasketch](https://ekzhu.github.io/datasketch/) (Python): minhash, Jaccard similarity, minhash-based indexing with Jaccard threshold, Top-K, or @@ -154,6 +160,8 @@ Other resources: plugin for a general-purpose search engine - [bbalet/stopwords](https://github.com/bbalet/stopwords) (Golang): for a dozen+ languages. also does HTML stripping +- [soundcloud/cosine-lsh-jo](https://github.com/soundcloud/cosine-lsh-jo) + (Spark): nearest-neighbor clustering ## References @@ -166,6 +174,8 @@ Other resources: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ACM Press, 2002" +> The original simhash paper + "Manku, Jain, Sarma: Detecting Near-Duplicates for Web Crawling". in Proceedings of the 16th international conference on World Wide Web, ACM Press, 2007 @@ -1,2 +1,3 @@ - small set of example files, extracted text, tokenized, token-hashes, and complete hashes (in this repo) +- package moz's simhash-cpp for debian diff --git a/papers/buttler_structural_similarity.pdf b/papers/buttler_structural_similarity.pdf Binary files differnew file mode 100644 index 0000000..0e5cbee --- /dev/null +++ b/papers/buttler_structural_similarity.pdf diff --git a/papers/charikar_simhash.pdf b/papers/charikar_simhash.pdf Binary files differnew file mode 100644 index 0000000..ffe9863 --- /dev/null +++ b/papers/charikar_simhash.pdf diff --git a/papers/citeseerx_simhash_shingle.pdf b/papers/citeseerx_simhash_shingle.pdf Binary files differnew file mode 100644 index 0000000..0e76b98 --- /dev/null +++ b/papers/citeseerx_simhash_shingle.pdf diff --git a/papers/defense_minhash_simhash.pdf b/papers/defense_minhash_simhash.pdf Binary files differnew file mode 100644 index 0000000..e127a3b --- /dev/null +++ b/papers/defense_minhash_simhash.pdf diff --git a/papers/google_simhash.pdf b/papers/google_simhash.pdf Binary files differnew file mode 100644 index 0000000..a9320ac --- /dev/null +++ b/papers/google_simhash.pdf diff --git a/papers/redundancy_elimination_large_collection.pdf b/papers/redundancy_elimination_large_collection.pdf Binary files differnew file mode 100644 index 0000000..03525a8 --- /dev/null +++ b/papers/redundancy_elimination_large_collection.pdf diff --git a/papers/scalable_document_fingerprinting_abstract.pdf b/papers/scalable_document_fingerprinting_abstract.pdf Binary files differnew file mode 100644 index 0000000..9048b5c --- /dev/null +++ b/papers/scalable_document_fingerprinting_abstract.pdf diff --git a/updates_20200525.md b/updates_20200525.md new file mode 100644 index 0000000..6b8082b --- /dev/null +++ b/updates_20200525.md @@ -0,0 +1,25 @@ + +There seem to be a bunch of new resources since I first researched this topic in 2017. + +Chapter 3: Finding Similar Items +of Mining Massive Datasets textbook +http://www.mmds.org/ +http://infolab.stanford.edu/~ullman/mmds/ch3.pdf + +1 + 1 = 1 or Record Deduplication with Python +https://www.youtube.com/watch?v=4O87RdBgRJ4&feature=youtu.be + +Entity Resolution: Introduction +https://www2.cs.duke.edu/courses/spring17/compsci590.1/lectures/14-er-intro.pdf + +Document Deduplication with Locality Sensitive Hashing +https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html + +"How to speedup MinHash LSH indexing for hundreds of millions of MinHashes?" +https://github.com/ekzhu/datasketch/issues/41 + +"MinHash LSH for document clustering" +https://github.com/ekzhu/datasketch/issues/120 + +"Easy computation of all duplicates?" +https://github.com/ekzhu/datasketch/issues/76 |