diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-07-09 13:26:35 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-07-09 13:26:35 +0200 |
commit | 002764b5b1f8f27bd8ae42d33b2a6f42a2a4b7a1 (patch) | |
tree | 38b7aa860202812f951ee5dc86da3a79200258ff /notes/old_pipeline.md | |
parent | f9ef1c989b4f85c81ac5f24b08f0d636636e7a4b (diff) | |
parent | e05f4c4973fc3573d3707d4d90779fad094ced6f (diff) | |
download | fuzzycat-002764b5b1f8f27bd8ae42d33b2a6f42a2a4b7a1.tar.gz fuzzycat-002764b5b1f8f27bd8ae42d33b2a6f42a2a4b7a1.zip |
Merge branch 'master' of git.archive.org:webgroup/fuzzycat
* 'master' of git.archive.org:webgroup/fuzzycat:
simplify README for general audience; move some content to notes
sandcrawler slugify: lower-case greek ambiguity (OCR)
DOI clean/normalize helper; and use in verification etc
verify: page count parsing and comparison improvements
Diffstat (limited to 'notes/old_pipeline.md')
-rw-r--r-- | notes/old_pipeline.md | 177 |
1 files changed, 177 insertions, 0 deletions
diff --git a/notes/old_pipeline.md b/notes/old_pipeline.md new file mode 100644 index 0000000..2f84d66 --- /dev/null +++ b/notes/old_pipeline.md @@ -0,0 +1,177 @@ + +## Performance + +For development, we worked on a `release_export_expanded.json` dump (113G/700G zstd/plain, 154,203,375 lines) and with the [fatcat API](https://api.fatcat.wiki/). + + +### Clustering + +Clustering derives sets of similar documents from a [fatcat database release +dump](https://archive.org/details/fatcat_snapshots_and_exports?&sort=-publicdate). + + +Example running clustering: + +``` +$ python -m fuzzycat cluster -t tsandcrawler < data/re.json | zstd -c -T0 > cluster.json.zst +``` + +Clustering works in a three step process: + +1. key extraction for each document (choose algorithm) +2. sorting by keys (via [GNU sort](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html)) +3. group by key and write out ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby)) + +Note: For long running processes, this all-or-nothing approach is impractical; +e.g. running clustering on the joint references and fatcat dataset (2B records) +takes 24h+. + +Ideas: + +* [ ] make (sorted) key extraction a fast standalone thing + +> `cat data.jsonl | fuzzycat-key --algo X > data.key.tsv` + +Where `data.key` group (id, key, blob) or the like. Make this line speed (maybe +w/ rust). Need to carry the blob, as we do not want to restrict options. + + +## Verification + +Run verification (pairwise *double-check* of match candidates in a cluster). + +``` +$ time zstdcat -T0 sample_cluster.json.zst | python -m fuzzycat verify > sample_verify.txt + +real 7m56.713s +user 8m50.703s +sys 0m29.262s +``` + +This is a one-pass operation. For processing 150M docs, we very much depend on +the documents being on disk in a file (we keep the complete document in the +clustering result). + +Example results: + +``` +3450874 Status.EXACT Reason.TITLE_AUTHOR_MATCH +2619990 Status.STRONG Reason.SLUG_TITLE_AUTHOR_MATCH +2487633 Status.DIFFERENT Reason.YEAR +2434532 Status.EXACT Reason.WORK_ID +2085006 Status.DIFFERENT Reason.CONTRIB_INTERSECTION_EMPTY +1397420 Status.DIFFERENT Reason.SHARED_DOI_PREFIX +1355852 Status.DIFFERENT Reason.RELEASE_TYPE +1290162 Status.AMBIGUOUS Reason.DUMMY +1145511 Status.DIFFERENT Reason.BOOK_CHAPTER +1009657 Status.DIFFERENT Reason.DATASET_DOI + 996503 Status.STRONG Reason.PMID_DOI_PAIR + 868951 Status.EXACT Reason.DATACITE_VERSION + 796216 Status.STRONG Reason.DATACITE_RELATED_ID + 704154 Status.STRONG Reason.FIGSHARE_VERSION + 534963 Status.STRONG Reason.VERSIONED_DOI + 343310 Status.STRONG Reason.TOKENIZED_AUTHORS + 334974 Status.STRONG Reason.JACCARD_AUTHORS + 293835 Status.STRONG Reason.PREPRINT_PUBLISHED + 269366 Status.DIFFERENT Reason.COMPONENT + 263626 Status.DIFFERENT Reason.SUBTITLE + 224021 Status.AMBIGUOUS Reason.SHORT_TITLE + 152990 Status.DIFFERENT Reason.PAGE_COUNT + 133811 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_5860_CHOICE_REVIEW + 122600 Status.AMBIGUOUS Reason.CUSTOM_PREFIX_10_7916 + 79664 Status.STRONG Reason.CUSTOM_IEEE_ARXIV + 46649 Status.DIFFERENT Reason.CUSTOM_PREFIX_10_14288 + 39797 Status.DIFFERENT Reason.JSTOR_ID + 38598 Status.STRONG Reason.CUSTOM_BSI_UNDATED + 18907 Status.STRONG Reason.CUSTOM_BSI_SUBDOC + 15465 Status.EXACT Reason.DOI + 13393 Status.DIFFERENT Reason.CUSTOM_IOP_MA_PATTERN + 10378 Status.DIFFERENT Reason.CONTAINER + 3081 Status.AMBIGUOUS Reason.BLACKLISTED + 2504 Status.AMBIGUOUS Reason.BLACKLISTED_FRAGMENT + 1273 Status.AMBIGUOUS Reason.APPENDIX + 1063 Status.DIFFERENT Reason.TITLE_FILENAME + 104 Status.DIFFERENT Reason.NUM_DIFF + 4 Status.STRONG Reason.ARXIV_VERSION +``` + +## A full run + +Single threaded, 42h. + +``` +$ time zstdcat -T0 release_export_expanded.json.zst | \ + TMPDIR=/bigger/tmp python -m fuzzycat cluster --tmpdir /bigger/tmp -t tsandcrawler | \ + zstd -c9 > cluster_tsandcrawler.json.zst +{ + "key_fail": 0, + "key_ok": 154202433, + "key_empty": 942, + "key_denylist": 0, + "num_clusters": 124321361 +} + +real 2559m7.880s +user 2605m41.347s +sys 118m38.141s +``` + +So, 29881072 (about 20%) docs in the potentially duplicated set. Verification (about 15h w/o parallel): + +``` +$ time zstdcat -T0 cluster_tsandcrawler.json.zst | python -m fuzzycat verify | \ + zstd -c9 > cluster_tsandcrawler_verified_3c7378.tsv.zst + +... + +real 927m28.631s +user 939m32.761s +sys 36m47.602s +``` + +---- + +# Misc + + +## Usage + +Release clusters start with release entities json lines. + +```shell +$ cat data/sample.json | python -m fuzzycat cluster -t title > out.json +``` + +Clustering 1M records (single core) takes about 64s (15K docs/s). + +```shell +$ head -1 out.json +{ + "k": "裏表紙", + "v": [ + ... + ] +} +``` + +Using GNU parallel to make it faster. + +``` +$ cat data/sample.json | parallel -j 8 --pipe --roundrobin python -m fuzzycat.main cluster -t title +``` + +Interestingly, the parallel variants detects fewer clusters (because data is +split and clusters are searched within each batch). TODO(miku): sort out sharding bug. + +# Notes on Refs + +* technique from fuzzycat ported in parts to [skate](https://github.com/miku/skate) - to go from refs and release dataset to a number of clusters, relating references to releases +* need to verify, but not the references against each other, only refs againt the release + +# Notes on Performance + +While running bulk (1B+) clustering and verification, even with parallel, +fuzzycat got slow. The citation graph project therefore contains a +reimplementation of `fuzzycat.verify` and related functions in Go, which in +this case is an order of magnitude faster. See: +[skate](https://git.archive.org/martin/cgraph/-/tree/master/skate). |