# fuzzycat (wip) Fuzzy matching publications for [fatcat](https://fatcat.wiki). * [fuzzycat](https://pypi.org/project/fuzzycat/) Note: This is currently work-in-progress. # Example Run Run any clustering algorithm. ``` $ time python -m fuzzycat cluster -t tsandcrawler < data/sample10m.json | \ zstd -c9 > sample_cluster.json.zst 2020-11-18 00:19:48.194 DEBUG __main__ - run_cluster: {"key_fail": 0, "key_ok": 9999938, "key_empty": 62, "key_denylist": 0, "num_clusters": 9040789} real 75m23.045s user 95m14.455s sys 3m39.121s ``` Run verification. ``` $ time zstdcat -T0 sample_cluster.json.zst | python -m fuzzycat verify > sample_verify.txt real 7m56.713s user 8m50.703s sys 0m29.262s ``` Example results over 10M docs: ```json { "miss.appendix": 176, "miss.arxiv_version": 25, "miss.blacklisted": 12082, "miss.blacklisted_fragment": 5, "miss.book_chapter": 46733, "miss.component": 1567, "miss.contrib_intersection_empty": 47691, "miss.dataset_doi": 30806, "miss.num_diff": 1, "miss.release_type": 157718, "miss.short_title": 16263, "miss.subtitle": 6013, "miss.title_filename": 57, "miss.year": 148755, "ok.arxiv_version": 93, "ok.dummy": 88294, "ok.preprint_published": 110, "ok.slug_title_author_match": 15818, "ok.title_author_match": 93240, "skip.container_name_blacklist": 20, "skip.publisher_blacklist": 456, "skip.too_large": 7430, "skip.unique": 8808462, "total": 9481815 } ``` # Use cases * [ ] take a release entity database dump as JSON lines and cluster releases (according to various algorithms) * [ ] take cluster information and run a verification step (misc algorithms) * [ ] create a dataset that contains grouping of releases under works * [ ] command line tools to generate cache keys, e.g. to match reference strings to release titles (this needs some transparent setup, e.g. filling of a cache before ops) # Usage Release clusters start with release entities json lines. ```shell $ cat data/sample.json | python -m fuzzycat cluster -t title > out.json ``` Clustering 1M records (single core) takes about 64s (15K docs/s). ```shell $ head -1 out.json { "k": "裏表紙", "v": [ ... ] } ``` Using GNU parallel to make it faster. ``` $ cat data/sample.json | parallel -j 8 --pipe --roundrobin python -m fuzzycat.main cluster -t title ``` Interestingly, the parallel variants detects fewer clusters (because data is split and clusters are searched within each batch). TODO(miku): sort out sharding bug. ## QA ### 10M release dataset Notes on cadd28a version clustering (nysiis) and verification. * 10M docs * 9040789 groups * 665447 verification pairs ``` 176 Miss.APPENDIX 25 Miss.ARXIV_VERSION 12082 Miss.BLACKLISTED 5 Miss.BLACKLISTED_FRAGMENT 46733 Miss.BOOK_CHAPTER 1567 Miss.COMPONENT 47691 Miss.CONTRIB_INTERSECTION_EMPTY 30806 Miss.DATASET_DOI 1 Miss.NUM_DIFF 157718 Miss.RELEASE_TYPE 16263 Miss.SHORT_TITLE 6013 Miss.SUBTITLE 57 Miss.TITLE_FILENAME 148755 Miss.YEAR 93 OK.ARXIV_VERSION 88294 OK.DUMMY 110 OK.PREPRINT_PUBLISHED 15818 OK.SLUG_TITLE_AUTHOR_MATCH 93240 OK.TITLE_AUTHOR_MATCH ``` Cases * common title, "Books by Our Readers", https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq, https://fatcat.wiki/release/4uv5jsy5vnhdvnxvzmucqlksvq * common title, "The Future of Imprisonment" * same title "IEEE Transactions on Wireless Communications", same publisher, different year * same, except DOI, but maybe the same item, after all? https://fatcat.wiki/release/kxgsbh66v5bwhobcaiuh4i7dwy, https://fatcat.wiki/release/thl7o44z3jgk3njdypixwrdbve Possible improvements: * when title and authors match, check the year, and maybe the doi prefix; doi with the same prefix may not be duplicates