add updated links/resources (from 2020)HEAD master

author: Bryan Newbold <bnewbold@archive.org> 2022-09-07 18:47:56 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2022-09-07 18:47:56 -0700
commit: 09d783c9d8557649cbe4fca91145d0236b8f6090 (patch)
tree: c96db080174162cea5552aa42b9de238d3e412e6
parent: 880dcee30da19c7802f1c1a40448d06f886024d7 (diff)
download: lsh-interop-master.tar.gz
lsh-interop-master.zip
2 files changed, 27 insertions, 0 deletions
diff --git a/README.md b/README.md
index 7f9f2d5..ba909a7 100644
--- a/README.md
+++ b/README.md
@@ -160,6 +160,8 @@ Other resources:
   plugin for a general-purpose search engine
 - [bbalet/stopwords](https://github.com/bbalet/stopwords) (Golang): for a
   dozen+ languages. also does HTML stripping
+- [soundcloud/cosine-lsh-jo](https://github.com/soundcloud/cosine-lsh-jo)
+  (Spark): nearest-neighbor clustering
 
 ## References
 
diff --git a/updates_20200525.md b/updates_20200525.md
new file mode 100644
index 0000000..6b8082b
--- /dev/null
+++ b/updates_20200525.md
@@ -0,0 +1,25 @@
+
+There seem to be a bunch of new resources since I first researched this topic in 2017.
+
+Chapter 3: Finding Similar Items
+of Mining Massive Datasets textbook
+http://www.mmds.org/
+http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
+
+1 + 1 = 1 or Record Deduplication with Python
+https://www.youtube.com/watch?v=4O87RdBgRJ4&feature=youtu.be
+
+Entity Resolution: Introduction
+https://www2.cs.duke.edu/courses/spring17/compsci590.1/lectures/14-er-intro.pdf
+
+Document Deduplication with Locality Sensitive Hashing
+https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html
+
+"How to speedup MinHash LSH indexing for hundreds of millions of MinHashes?"
+https://github.com/ekzhu/datasketch/issues/41
+
+"MinHash LSH for document clustering"
+https://github.com/ekzhu/datasketch/issues/120
+
+"Easy computation of all duplicates?"
+https://github.com/ekzhu/datasketch/issues/76
author	Bryan Newbold <bnewbold@archive.org>	2022-09-07 18:47:56 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2022-09-07 18:47:56 -0700
commit	09d783c9d8557649cbe4fca91145d0236b8f6090 (patch)
tree	c96db080174162cea5552aa42b9de238d3e412e6
parent	880dcee30da19c7802f1c1a40448d06f886024d7 (diff)
download	lsh-interop-master.tar.gz lsh-interop-master.zip