diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-30 17:38:25 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-30 17:38:25 -0800 |
commit | 3cd594f951728165556849bdff1c35ce1d7daab1 (patch) | |
tree | 36b2a48b01ff59e76a515b40c1540a44773d47c6 /extra | |
parent | 3a8226064af01bd59575bb44de0301eefe36a235 (diff) | |
download | fatcat-3cd594f951728165556849bdff1c35ce1d7daab1.tar.gz fatcat-3cd594f951728165556849bdff1c35ce1d7daab1.zip |
chocula update notes
Diffstat (limited to 'extra')
-rw-r--r-- | extra/bulk_edits/2021-11-30_chocula.md | 61 |
1 files changed, 61 insertions, 0 deletions
diff --git a/extra/bulk_edits/2021-11-30_chocula.md b/extra/bulk_edits/2021-11-30_chocula.md new file mode 100644 index 00000000..299ce077 --- /dev/null +++ b/extra/bulk_edits/2021-11-30_chocula.md @@ -0,0 +1,61 @@ + +Periodic update of journal metadata using Chocula. This is just after doing +container merges, after some schema updates (`issne` and `issnp` as top-level +fields), and prior to doing broad re-indexing. + +Using `journal-metadata-bot` and `chocula.2021-11-30.json` export. + + +## QA Testing + + git log | head -n1 + # commit 1177dafb9b185c7b749ff95ded1a0720792fbb5e + + export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...] + + head -n50 /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.sample.json | ./fatcat_import.py chocula --do-updates - + # Counter({'total': 50, 'exists': 28, 'exists-skip-update': 27, 'update': 22, 'exists-not-found': 1, 'skip': 0, 'insert': 0}) + +Run exact same to ensure no new updates: + + # Counter({'total': 50, 'exists': 50, 'exists-skip-update': 49, 'exists-not-found': 1, 'skip': 0, 'insert': 0, 'update': 0}) + +Try larger batch: + + tail -n200 /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.sample.json | ./fatcat_import.py chocula --do-updates - + + # ValueError: Invalid value for `issnp`, length must be greater than or equal to `9` + +Patched this issue. Try the full run: + + cat /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.sample.json | ./fatcat_import.py chocula --do-updates - + # Counter({'total': 1000, 'exists': 683, 'exists-skip-update': 665, 'update': 308, 'exists-not-found': 16, 'insert': 9, 'exists-by-issnl': 2, 'skip': 0}) + +Ok, expecting about 1/3 of containers to be updated in full prod run. + +## Prod Run + + git log | head -n1 + # commit b1efd59c2cad275d126a1bde67c11430d71878db + + export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...] + + head -n100 /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.json | ./fatcat_import.py chocula --do-updates - + # Counter({'total': 100, 'exists': 53, 'exists-skip-update': 53, 'update': 47, 'skip': 0, 'insert': 0}) + +Diffs look good; lots of preservation coverage updates. Do full run: + + time cat /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.json | ./fatcat_import.py chocula --do-updates - + + # Invalid value for `issnp`, must be a follow pattern or equal to `/\d{4}-\d{3}[0-9X]/ + +Edited the relevant entity by hand, then worked around some other ISSN +formatting issues with a commit. Then re-started the import. Stats will not be +accurate because of the restarts. + + Counter({'total': 180608, 'exists': 131755, 'exists-skip-update': 131055, 'update': 47505, 'insert': 1348, 'exists-by-issnl': 700, 'skip': 0}) + + real 12m24.781s + user 5m8.039s + sys 0m15.353s + |