There seem to be a bunch of new resources since I first researched this topic in 2017. Chapter 3: Finding Similar Items of Mining Massive Datasets textbook http://www.mmds.org/ http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 1 + 1 = 1 or Record Deduplication with Python https://www.youtube.com/watch?v=4O87RdBgRJ4&feature=youtu.be Entity Resolution: Introduction https://www2.cs.duke.edu/courses/spring17/compsci590.1/lectures/14-er-intro.pdf Document Deduplication with Locality Sensitive Hashing https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html "How to speedup MinHash LSH indexing for hundreds of millions of MinHashes?" https://github.com/ekzhu/datasketch/issues/41 "MinHash LSH for document clustering" https://github.com/ekzhu/datasketch/issues/120 "Easy computation of all duplicates?" https://github.com/ekzhu/datasketch/issues/76