1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
|
Periodic update of journal metadata using Chocula. This is just after doing
container merges, after some schema updates (`issne` and `issnp` as top-level
fields), and prior to doing broad re-indexing.
Using `journal-metadata-bot` and `chocula.2021-11-30.json` export.
## QA Testing
git log | head -n1
# commit 1177dafb9b185c7b749ff95ded1a0720792fbb5e
export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
head -n50 /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.sample.json | ./fatcat_import.py chocula --do-updates -
# Counter({'total': 50, 'exists': 28, 'exists-skip-update': 27, 'update': 22, 'exists-not-found': 1, 'skip': 0, 'insert': 0})
Run exact same to ensure no new updates:
# Counter({'total': 50, 'exists': 50, 'exists-skip-update': 49, 'exists-not-found': 1, 'skip': 0, 'insert': 0, 'update': 0})
Try larger batch:
tail -n200 /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.sample.json | ./fatcat_import.py chocula --do-updates -
# ValueError: Invalid value for `issnp`, length must be greater than or equal to `9`
Patched this issue. Try the full run:
cat /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.sample.json | ./fatcat_import.py chocula --do-updates -
# Counter({'total': 1000, 'exists': 683, 'exists-skip-update': 665, 'update': 308, 'exists-not-found': 16, 'insert': 9, 'exists-by-issnl': 2, 'skip': 0})
Ok, expecting about 1/3 of containers to be updated in full prod run.
## Prod Run
git log | head -n1
# commit b1efd59c2cad275d126a1bde67c11430d71878db
export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
head -n100 /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.json | ./fatcat_import.py chocula --do-updates -
# Counter({'total': 100, 'exists': 53, 'exists-skip-update': 53, 'update': 47, 'skip': 0, 'insert': 0})
Diffs look good; lots of preservation coverage updates. Do full run:
time cat /srv/fatcat/datasets/chocula_fatcat_export.2021-11-30.json | ./fatcat_import.py chocula --do-updates -
# Invalid value for `issnp`, must be a follow pattern or equal to `/\d{4}-\d{3}[0-9X]/
Edited the relevant entity by hand, then worked around some other ISSN
formatting issues with a commit. Then re-started the import. Stats will not be
accurate because of the restarts.
Counter({'total': 180608, 'exists': 131755, 'exists-skip-update': 131055, 'update': 47505, 'insert': 1348, 'exists-by-issnl': 700, 'skip': 0})
real 12m24.781s
user 5m8.039s
sys 0m15.353s
|