diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-09-07 18:47:56 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-09-07 18:47:56 -0700 |
commit | 09d783c9d8557649cbe4fca91145d0236b8f6090 (patch) | |
tree | c96db080174162cea5552aa42b9de238d3e412e6 | |
parent | 880dcee30da19c7802f1c1a40448d06f886024d7 (diff) | |
download | lsh-interop-09d783c9d8557649cbe4fca91145d0236b8f6090.tar.gz lsh-interop-09d783c9d8557649cbe4fca91145d0236b8f6090.zip |
-rw-r--r-- | README.md | 2 | ||||
-rw-r--r-- | updates_20200525.md | 25 |
2 files changed, 27 insertions, 0 deletions
@@ -160,6 +160,8 @@ Other resources: plugin for a general-purpose search engine - [bbalet/stopwords](https://github.com/bbalet/stopwords) (Golang): for a dozen+ languages. also does HTML stripping +- [soundcloud/cosine-lsh-jo](https://github.com/soundcloud/cosine-lsh-jo) + (Spark): nearest-neighbor clustering ## References diff --git a/updates_20200525.md b/updates_20200525.md new file mode 100644 index 0000000..6b8082b --- /dev/null +++ b/updates_20200525.md @@ -0,0 +1,25 @@ + +There seem to be a bunch of new resources since I first researched this topic in 2017. + +Chapter 3: Finding Similar Items +of Mining Massive Datasets textbook +http://www.mmds.org/ +http://infolab.stanford.edu/~ullman/mmds/ch3.pdf + +1 + 1 = 1 or Record Deduplication with Python +https://www.youtube.com/watch?v=4O87RdBgRJ4&feature=youtu.be + +Entity Resolution: Introduction +https://www2.cs.duke.edu/courses/spring17/compsci590.1/lectures/14-er-intro.pdf + +Document Deduplication with Locality Sensitive Hashing +https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html + +"How to speedup MinHash LSH indexing for hundreds of millions of MinHashes?" +https://github.com/ekzhu/datasketch/issues/41 + +"MinHash LSH for document clustering" +https://github.com/ekzhu/datasketch/issues/120 + +"Easy computation of all duplicates?" +https://github.com/ekzhu/datasketch/issues/76 |