summaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-30 17:38:25 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-30 17:38:25 -0800
commit3cd594f951728165556849bdff1c35ce1d7daab1 (patch)
tree36b2a48b01ff59e76a515b40c1540a44773d47c6 /extra/bulk_edits
parent3a8226064af01bd59575bb44de0301eefe36a235 (diff)
downloadfatcat-3cd594f951728165556849bdff1c35ce1d7daab1.tar.gz
fatcat-3cd594f951728165556849bdff1c35ce1d7daab1.zip
chocula update notes
Diffstat (limited to 'extra/bulk_edits')
-rw-r--r--extra/bulk_edits/2021-11-30_chocula.md61
1 files changed, 61 insertions, 0 deletions
diff --git a/extra/bulk_edits/2021-11-30_chocula.md b/extra/bulk_edits/2021-11-30_chocula.md
new file mode 100644
index 00000000..299ce077
--- /dev/null
+++ b/extra/bulk_edits/2021-11-30_chocula.md
@@ -0,0 +1,61 @@
+
+Periodic update of journal metadata using Chocula. This is just after doing
+container merges, after some schema updates (`issne` and `issnp` as top-level
+fields), and prior to doing broad re-indexing.
+
+Using `journal-metadata-bot` and `chocula.2021-11-30.json` export.
+
+
+## QA Testing
+
+ git log | head -n1
+ # commit 1177dafb9b185c7b749ff95ded1a0720792fbb5e
+
+ export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
+
+ head -n50 /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.sample.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 50, 'exists': 28, 'exists-skip-update': 27, 'update': 22, 'exists-not-found': 1, 'skip': 0, 'insert': 0})
+
+Run exact same to ensure no new updates:
+
+ # Counter({'total': 50, 'exists': 50, 'exists-skip-update': 49, 'exists-not-found': 1, 'skip': 0, 'insert': 0, 'update': 0})
+
+Try larger batch:
+
+ tail -n200 /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.sample.json | ./fatcat_import.py chocula --do-updates -
+
+ # ValueError: Invalid value for `issnp`, length must be greater than or equal to `9`
+
+Patched this issue. Try the full run:
+
+ cat /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.sample.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 1000, 'exists': 683, 'exists-skip-update': 665, 'update': 308, 'exists-not-found': 16, 'insert': 9, 'exists-by-issnl': 2, 'skip': 0})
+
+Ok, expecting about 1/3 of containers to be updated in full prod run.
+
+## Prod Run
+
+ git log | head -n1
+ # commit b1efd59c2cad275d126a1bde67c11430d71878db
+
+ export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
+
+ head -n100 /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.json | ./fatcat_import.py chocula --do-updates -
+ # Counter({'total': 100, 'exists': 53, 'exists-skip-update': 53, 'update': 47, 'skip': 0, 'insert': 0})
+
+Diffs look good; lots of preservation coverage updates. Do full run:
+
+ time cat /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.json | ./fatcat_import.py chocula --do-updates -
+
+ # Invalid value for `issnp`, must be a follow pattern or equal to `/\d{4}-\d{3}[0-9X]/
+
+Edited the relevant entity by hand, then worked around some other ISSN
+formatting issues with a commit. Then re-started the import. Stats will not be
+accurate because of the restarts.
+
+ Counter({'total': 180608, 'exists': 131755, 'exists-skip-update': 131055, 'update': 47505, 'insert': 1348, 'exists-by-issnl': 700, 'skip': 0})
+
+ real 12m24.781s
+ user 5m8.039s
+ sys 0m15.353s
+