## Performance For development, we worked on a `release_export_expanded.json` dump (113G/700G zstd/plain, 154,203,375 lines) and with the [fatcat API](https://api.fatcat.wiki/). ### Clustering Clustering derives sets of similar documents from a [fatcat database release dump](https://archive.org/details/fatcat_snapshots_and_exports?&sort=-publicdate). Example running clustering: ``` $ python -m fuzzycat cluster -t tsandcrawler < data/re.json | zstd -c -T0 > cluster.json.zst ``` Clustering works in a three step process: 1. key extraction for each document (choose algorithm) 2. sorting by keys (via [GNU sort](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html)) 3. group by key and write out ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby)) Note: For long running processes, this all-or-nothing approach is impractical; e.g. running clustering on the joint references and fatcat dataset (2B records) takes 24h+. Ideas: * [ ] make (sorted) key extraction a fast standalone thing > `cat data.jsonl | fuzzycat-key --algo X > data.key.tsv` Where `data.key` group (id, key, blob) or the like. Make this line speed (maybe w/ rust). Need to carry the blob, as we do not want to restrict options. ## Verification Run verification (pairwise *double-check* of match candidates in a cluster). ``` $ time zstdcat -T0 sample_cluster.json.zst | python -m fuzzycat verify > sample_verify.txt real 7m56.713s user 8m50.703s sys 0m29.262s ``` This is a one-pass operation. For processing 150M docs, we very much depend on the documents being on disk in a file (we keep the complete document in the clustering result). Example results: ``` 3450874 Status.EXACT Reason.TITLE_AUTHOR_MATCH 2619990 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH 2487633 Status.DIFFERENT Reason.YEAR 2434532 Status.EXACT Reason.WORK_ID 2085006 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY 1397420 Status.DIFFERENT Reason.SHARED_DOI_PREFIX 1355852 Status.DIFFERENT Reason.RELEASE_TYPE 1290162 Status.AMBIGUOUS Reason.DUMMY 1145511 Status.DIFFERENT Reason.BOOK_CHAPTER 1009657 Status.DIFFERENT Reason.DATASET_DOI 996503 Status.STRONG Reason.PMID_DOI_PAIR 868951 Status.EXACT Reason.DATACITE_VERSION 796216 Status.STRONG Reason.DATACITE_RELATED_ID 704154 Status.STRONG Reason.FIGSHARE_VERSION 534963 Status.STRONG Reason.VERSIONED_DOI 343310 Status.STRONG Reason.TOKENIZED_AUTHORS 334974 Status.STRONG Reason.JACCARD_AUTHORS 293835 Status.STRONG Reason.PREPRINT_PUBLISHED 269366 Status.DIFFERENT Reason.COMPONENT 263626 Status.DIFFERENT Reason.SUBTITLE 224021 Status.AMBIGUOUS Reason.SHORT_TITLE 152990 Status.DIFFERENT Reason.PAGE_COUNT 133811 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW 122600 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916 79664 Status.STRONG Reason.CUSTOM_IEEE_ARXIV 46649 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288 39797 Status.DIFFERENT Reason.JSTOR_ID 38598 Status.STRONG Reason.CUSTOM_BSI_UNDATED 18907 Status.STRONG Reason.CUSTOM_BSI_SUBDOC 15465 Status.EXACT Reason.DOI 13393 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN 10378 Status.DIFFERENT Reason.CONTAINER 3081 Status.AMBIGUOUS Reason.BLACKLISTED 2504 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT 1273 Status.AMBIGUOUS Reason.APPENDIX 1063 Status.DIFFERENT Reason.TITLE_FILENAME 104 Status.DIFFERENT Reason.NUM_DIFF 4 Status.STRONG Reason.ARXIV_VERSION ``` ## A full run Single threaded, 42h. ``` $ time zstdcat -T0 release_export_expanded.json.zst | \ TMPDIR=/bigger/tmp python -m fuzzycat cluster --tmpdir /bigger/tmp -t tsandcrawler | \ zstd -c9 > cluster_tsandcrawler.json.zst { "key_fail": 0, "key_ok": 154202433, "key_empty": 942, "key_denylist": 0, "num_clusters": 124321361 } real 2559m7.880s user 2605m41.347s sys 118m38.141s ``` So, 29881072 (about 20%) docs in the potentially duplicated set. Verification (about 15h w/o parallel): ``` $ time zstdcat -T0 cluster_tsandcrawler.json.zst | python -m fuzzycat verify | \ zstd -c9 > cluster_tsandcrawler_verified_3c7378.tsv.zst ... real 927m28.631s user 939m32.761s sys 36m47.602s ``` ---- # Misc ## Usage Release clusters start with release entities json lines. ```shell $ cat data/sample.json | python -m fuzzycat cluster -t title > out.json ``` Clustering 1M records (single core) takes about 64s (15K docs/s). ```shell $ head -1 out.json { "k": "裏表紙", "v": [ ... ] } ``` Using GNU parallel to make it faster. ``` $ cat data/sample.json | parallel -j 8 --pipe --roundrobin python -m fuzzycat.main cluster -t title ``` Interestingly, the parallel variants detects fewer clusters (because data is split and clusters are searched within each batch). TODO(miku): sort out sharding bug. # Notes on Refs * technique from fuzzycat ported in parts to [skate](https://github.com/miku/skate) - to go from refs and release dataset to a number of clusters, relating references to releases * need to verify, but not the references against each other, only refs againt the release # Notes on Performance While running bulk (1B+) clustering and verification, even with parallel, fuzzycat got slow. The citation graph project therefore contains a reimplementation of `fuzzycat.verify` and related functions in Go, which in this case is an order of magnitude faster. See: [skate](https://git.archive.org/martin/cgraph/-/tree/master/skate).