aboutsummaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2020-10-08_chocula.md
blob: d60b6842de5b60f53c77b26cbddac344eb7f639d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Another update of journal metadata. In this case due to expanding "Keepers"
coverage to PKP PLN, Hathitrust, Scholar's Portal, and Carniniana.

Using `journal-metadata-bot` and `chocula.2020-10-08.json` export.

## QA Testing

    shuf -n1000 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
    Counter({'total': 1000, 'exists': 640, 'exists-skip-update': 532, 'update': 348, 'exists-not-found': 108, 'insert': 12, 'skip': 0})

Expecting roughly a 1/3 update rate. Most of these seem to be true updates (eg,
adding kbart metadata). A smaller fraction are just updating DOAJ timestamp or
not updating any metadata at all.

    head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
    Counter({'total': 500, 'exists': 372, 'exists-skip-update': 328, 'update': 121, 'exists-not-found': 44, 'insert': 7, 'skip': 0})

    head -n500 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
    Counter({'total': 500, 'exists': 481, 'exists-skip-update': 430, 'exists-not-found': 44, 'update': 19, 'exists-by-issnl': 7, 'skip': 0, 'insert': 0})

Made some changes in `27fe31d5ffcac700c30b2b10d56685ef0fa4f3a8` which seem to
have removed the spurious null updates, while retaining DOAJ date-only updates.

Also as a small nit notice that occasionally `kbart` metadata gets added with
no year spans. This seems to be common with cariniana. Presumably this happens
when there is no year span info available, only volumes. Seems like a valuable
thing to include as a flag anyways.

## Prod Import

Start small:

    head -n100 /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
    => Counter({'total': 100, 'exists': 69, 'exists-skip-update': 68, 'update': 30, 'insert': 1, 'exists-by-issnl': 1, 'skip': 0})

Full batch:

    time cat /srv/fatcat/datasets/chocula.2020-10-08.json | ./fatcat_import.py chocula --do-updates -
    => Counter({'total': 167092, 'exists': 110594, 'exists-skip-update': 109852, 'update': 55274, 'insert': 1224, 'exists-by-issnl': 742, 'skip': 0})

    real    10m45.714s
    user    4m51.680s
    sys     0m12.236s