updates_20200525.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


There seem to be a bunch of new resources since I first researched this topic in 2017.

Chapter 3: Finding Similar Items
of Mining Massive Datasets textbook
http://www.mmds.org/
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

1 + 1 = 1 or Record Deduplication with Python
https://www.youtube.com/watch?v=4O87RdBgRJ4&feature=youtu.be

Entity Resolution: Introduction
https://www2.cs.duke.edu/courses/spring17/compsci590.1/lectures/14-er-intro.pdf

Document Deduplication with Locality Sensitive Hashing
https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html

"How to speedup MinHash LSH indexing for hundreds of millions of MinHashes?"
https://github.com/ekzhu/datasketch/issues/41

"MinHash LSH for document clustering"
https://github.com/ekzhu/datasketch/issues/120

"Easy computation of all duplicates?"
https://github.com/ekzhu/datasketch/issues/76