diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:34:02 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:34:02 -0800 |
commit | c32154f2875a7fb9aac727013e1475cdd811e180 (patch) | |
tree | f0e061498a101fa824995fb6ec9f91e7e44257e1 /notes/bulk_edits/2020-10-08_chocula.md | |
parent | c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff) | |
download | fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip |
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'notes/bulk_edits/2020-10-08_chocula.md')
-rw-r--r-- | notes/bulk_edits/2020-10-08_chocula.md | 44 |
1 files changed, 0 insertions, 44 deletions
diff --git a/notes/bulk_edits/2020-10-08_chocula.md b/notes/bulk_edits/2020-10-08_chocula.md deleted file mode 100644 index d60b6842..00000000 --- a/notes/bulk_edits/2020-10-08_chocula.md +++ /dev/null @@ -1,44 +0,0 @@ - -Another update of journal metadata. In this case due to expanding "Keepers" -coverage to PKP PLN, Hathitrust, Scholar's Portal, and Carniniana. - -Using `journal-metadata-bot` and `chocula.2020-10-08.json` export. - -## QA Testing - - shuf -n1000 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - - Counter({'total': 1000, 'exists': 640, 'exists-skip-update': 532, 'update': 348, 'exists-not-found': 108, 'insert': 12, 'skip': 0}) - -Expecting roughly a 1/3 update rate. Most of these seem to be true updates (eg, -adding kbart metadata). A smaller fraction are just updating DOAJ timestamp or -not updating any metadata at all. - - head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - - Counter({'total': 500, 'exists': 372, 'exists-skip-update': 328, 'update': 121, 'exists-not-found': 44, 'insert': 7, 'skip': 0}) - - head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - - Counter({'total': 500, 'exists': 481, 'exists-skip-update': 430, 'exists-not-found': 44, 'update': 19, 'exists-by-issnl': 7, 'skip': 0, 'insert': 0}) - -Made some changes in `27fe31d5ffcac700c30b2b10d56685ef0fa4f3a8` which seem to -have removed the spurious null updates, while retaining DOAJ date-only updates. - -Also as a small nit notice that occasionally `kbart` metadata gets added with -no year spans. This seems to be common with cariniana. Presumably this happens -when there is no year span info available, only volumes. Seems like a valuable -thing to include as a flag anyways. - -## Prod Import - -Start small: - - head -n100 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - - => Counter({'total': 100, 'exists': 69, 'exists-skip-update': 68, 'update': 30, 'insert': 1, 'exists-by-issnl': 1, 'skip': 0}) - -Full batch: - - time cat /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates - - => Counter({'total': 167092, 'exists': 110594, 'exists-skip-update': 109852, 'update': 55274, 'insert': 1224, 'exists-by-issnl': 742, 'skip': 0}) - - real 10m45.714s - user 4m51.680s - sys 0m12.236s |