aboutsummaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/2022-07-29_chocula.md
blob: 1f6f36cad369f808324e2e84f63110703799784a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

Periodic import of chocula metadata updates.

In particular, expecting a bunch of `publisher_type` updates.

Going to explicitly not do DOAJ-only updates this time around. That is, if
container would have been updated, then new DOAJ 'extra' metadata will pass
through. But don't only update entity for this reason. This is to reduce churn
based only on the `as-of` key. Should probably change the behavior next time
around.

## Prod Import

    date
    # Sat Jul 30 01:18:41 UTC 2022

    git log -n1
    # 5ecf72cbb488a9a50eb869ea55b4c2bfc1440731

    diff --git a/python/fatcat_tools/importers/chocula.py b/python/fatcat_tools/importers/chocula.py
    index 38802bcb..762c44dd 100644
    --- a/python/fatcat_tools/importers/chocula.py
    +++ b/python/fatcat_tools/importers/chocula.py
    @@ -139,7 +139,7 @@ class ChoculaImporter(EntityImporter):
             if ce.extra.get("publisher_type") and not ce.extra.get("publisher_type"):
                 # many older containers were missing this metadata
                 do_update = True
    -        for k in ("kbart", "ia", "doaj"):
    +        for k in ("kbart", "ia"):
                 # always update these fields if not equal (chocula override)
                 if ce.extra.get(k) and ce.extra[k] != existing.extra.get(k):
                     do_update = True

    export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
    shuf -n100 /srv/fatcat/datasets/chocula_fatcat_export.2022-07-30.json | ./fatcat_import.py chocula --do-updates -
    # Counter({'total': 100, 'exists': 98, 'exists-skip-update': 98, 'update': 2, 'skip': 0, 'insert': 0})

    shuf -n1000 /srv/fatcat/datasets/chocula_fatcat_export.2022-07-30.json | ./fatcat_import.py chocula --do-updates -
    # Counter({'total': 1000, 'exists': 986, 'exists-skip-update': 986, 'update': 12, 'insert': 2, 'skip': 0})

Huh, not seeing any `publisher_type` changes, which I was expecting more of.

    time cat /srv/fatcat/datasets/chocula_fatcat_export.2022-07-30.json | ./fatcat_import.py chocula --do-updates -
    # Counter({'total': 188506, 'exists': 185808, 'exists-skip-update': 185806, 'update': 2495, 'insert': 203, 'exists-by-issnl': 2, 'skip': 0})

Looking through the changelog, some did through with `publisher_type` updates.
Whew!