aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--README.md38
-rw-r--r--TODO1
-rw-r--r--papers/buttler_structural_similarity.pdfbin0 -> 343333 bytes
-rw-r--r--papers/charikar_simhash.pdfbin0 -> 158779 bytes
-rw-r--r--papers/citeseerx_simhash_shingle.pdfbin0 -> 495992 bytes
-rw-r--r--papers/defense_minhash_simhash.pdfbin0 -> 422080 bytes
-rw-r--r--papers/google_simhash.pdfbin0 -> 157615 bytes
-rw-r--r--papers/redundancy_elimination_large_collection.pdfbin0 -> 183552 bytes
-rw-r--r--papers/scalable_document_fingerprinting_abstract.pdfbin0 -> 167253 bytes
-rw-r--r--updates_20200525.md25
10 files changed, 50 insertions, 14 deletions
diff --git a/README.md b/README.md
index 54782be..ba909a7 100644
--- a/README.md
+++ b/README.md
@@ -48,6 +48,7 @@ Some notable algorithms to consider are:
- simhash (aka, Charikar hashing)
- minhash
- weighed minhash
+- shingle methods (with dozens of feature hashes per document)
However, the papers describing these algorithms have a few parameters (eg, hash
bit length, token hash algorithm), and leave several important steps up to the
@@ -110,34 +111,39 @@ Proposed extraction and tokenization:
- specifically, no zero-width or non-printing unicode modifiers
- numbers (unless part of an alphanumeric string, eg an acronym) should not be
included
-- TODO: instead, strip all numeric characters?
+- TODO: instead, just strip all numeric characters?
- OPTIONALLY, a language-specific stop-list appropriate for search-engine
indexing may be used.
Proposed algorithm and parameters:
- `simhash` algorithm
-- TODO: fast token hashing algorithm?
-- TODO: 64x signed 8-bit buckets during calculation
-- 64bit length final form
-- base32 (non-capitalization-sensitive) when serialized as a string
-- TODO: little-endian? does base32 specify this?
+- 64-bit `jenkins` hash algorithm for tokens
+- 64x signed buckets (of 64-bit depth) during calculation
+- 64-bit final form
+- base32 when serialized as a string (non-capitalization-sensitive)
-Recommended lookup parameters:
+Note: endianness does not need to be specified, as the 64bit feature hashes are
+defined as 8-byte octet strings, not 64-bit integers.
-- TODO: how many columns in rotation?
-- k=3 (number of bits to filter on)
-- TODO: Hamming distance
-- TODO: 0.XYZ considered a loose match, 0.XYZ considered a close match
+Recommended default lookup parameters (these are all optional):
+
+- k=3 ("distance", or number of differing bits to allow when filtering)
+- Hamming distance as a similarity metric (for speed)
+- 0.90 considered a loose match, 0.98 considered a close match
+- create 16 tables of 28-bits for lookup, by first chosing 1 out of 4 16-bit
+ blocks, then subdividing the remaining 48 bits into 4 12-bit blocks and
+ choosing one of them, as described in Manku et al 3.1.1.
## Existing Libraries and Implementations
These don't necessarily implement the schemes above, but they do implement the
basic algorithms.
-- [seomoz/simhash-cpp](https://github.com/seomoz/simhash-cpp) (C++): simhash,
- used in production for web de-dupe
-- [simhash](https://github.com/JanX2/simhash) (C++): Google's simhash
+- [seomoz/simhash-cpp](https://github.com/seomoz/simhash-cpp) (C++) and
+ [seomoz/simhash-py](https://github.com/seomoz/simhash-py) (python, wrapping
+ the C++ library): simhash, used in production for web de-duplication
+- [simhash](https://github.com/JanX2/simhash) (C++): mirror of Google's simhash
implementation (old?)
- [datasketch](https://ekzhu.github.io/datasketch/) (Python): minhash, Jaccard
similarity, minhash-based indexing with Jaccard threshold, Top-K, or
@@ -154,6 +160,8 @@ Other resources:
plugin for a general-purpose search engine
- [bbalet/stopwords](https://github.com/bbalet/stopwords) (Golang): for a
dozen+ languages. also does HTML stripping
+- [soundcloud/cosine-lsh-jo](https://github.com/soundcloud/cosine-lsh-jo)
+ (Spark): nearest-neighbor clustering
## References
@@ -166,6 +174,8 @@ Other resources:
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing,
ACM Press, 2002"
+> The original simhash paper
+
"Manku, Jain, Sarma: Detecting Near-Duplicates for Web Crawling". in
Proceedings of the 16th international conference on World Wide Web, ACM Press,
2007
diff --git a/TODO b/TODO
index c5cf9b5..28059f8 100644
--- a/TODO
+++ b/TODO
@@ -1,2 +1,3 @@
- small set of example files, extracted text, tokenized, token-hashes, and
complete hashes (in this repo)
+- package moz's simhash-cpp for debian
diff --git a/papers/buttler_structural_similarity.pdf b/papers/buttler_structural_similarity.pdf
new file mode 100644
index 0000000..0e5cbee
--- /dev/null
+++ b/papers/buttler_structural_similarity.pdf
Binary files differ
diff --git a/papers/charikar_simhash.pdf b/papers/charikar_simhash.pdf
new file mode 100644
index 0000000..ffe9863
--- /dev/null
+++ b/papers/charikar_simhash.pdf
Binary files differ
diff --git a/papers/citeseerx_simhash_shingle.pdf b/papers/citeseerx_simhash_shingle.pdf
new file mode 100644
index 0000000..0e76b98
--- /dev/null
+++ b/papers/citeseerx_simhash_shingle.pdf
Binary files differ
diff --git a/papers/defense_minhash_simhash.pdf b/papers/defense_minhash_simhash.pdf
new file mode 100644
index 0000000..e127a3b
--- /dev/null
+++ b/papers/defense_minhash_simhash.pdf
Binary files differ
diff --git a/papers/google_simhash.pdf b/papers/google_simhash.pdf
new file mode 100644
index 0000000..a9320ac
--- /dev/null
+++ b/papers/google_simhash.pdf
Binary files differ
diff --git a/papers/redundancy_elimination_large_collection.pdf b/papers/redundancy_elimination_large_collection.pdf
new file mode 100644
index 0000000..03525a8
--- /dev/null
+++ b/papers/redundancy_elimination_large_collection.pdf
Binary files differ
diff --git a/papers/scalable_document_fingerprinting_abstract.pdf b/papers/scalable_document_fingerprinting_abstract.pdf
new file mode 100644
index 0000000..9048b5c
--- /dev/null
+++ b/papers/scalable_document_fingerprinting_abstract.pdf
Binary files differ
diff --git a/updates_20200525.md b/updates_20200525.md
new file mode 100644
index 0000000..6b8082b
--- /dev/null
+++ b/updates_20200525.md
@@ -0,0 +1,25 @@
+
+There seem to be a bunch of new resources since I first researched this topic in 2017.
+
+Chapter 3: Finding Similar Items
+of Mining Massive Datasets textbook
+http://www.mmds.org/
+http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
+
+1 + 1 = 1 or Record Deduplication with Python
+https://www.youtube.com/watch?v=4O87RdBgRJ4&feature=youtu.be
+
+Entity Resolution: Introduction
+https://www2.cs.duke.edu/courses/spring17/compsci590.1/lectures/14-er-intro.pdf
+
+Document Deduplication with Locality Sensitive Hashing
+https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html
+
+"How to speedup MinHash LSH indexing for hundreds of millions of MinHashes?"
+https://github.com/ekzhu/datasketch/issues/41
+
+"MinHash LSH for document clustering"
+https://github.com/ekzhu/datasketch/issues/120
+
+"Easy computation of all duplicates?"
+https://github.com/ekzhu/datasketch/issues/76