notes/bulk_edits/2020-12-14_doaj.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105


## Earlier QA Testing (November 2020)

    export FATCAT_API_AUTH_TOKEN=... (FATCAT_AUTH_WORKER_DOAJ)

    # small test:
    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | head | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -

    # full run
    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -

    before: 519.17G
    after:  542.08G


    5.45M 6:29:17 [ 233 /s]

    12x of:
    Counter({'total': 455504, 'insert': 394437, 'exists': 60615, 'skip': 452, 'skip-title': 452, 'update': 0})

    total:  ~5,466,048
    insert: ~4,733,244 
    exists:   ~727,380

Initial imports (before crash) were like:

    Counter({'total': 9339, 'insert': 9330, 'skip': 9, 'skip-title': 9, 'update': 0, 'exists': 0})

Seems like there is a bug, not finding existing by DOI?

## Prod Container Metadata Update (chocula)

Generic update of container metadata using chocula pipeline. Need to run this
before DOAJ import to ensure we have all the containers already updated.

Also updating ISSN-L index at the same time. Using a 2020-11-19 metadata
snapshot, which was generated on 2020-12-07; more recent snapshots had small
upstream changes in some formats so it wasn't trivial to run with a newer
snapshot.

    # git rev: 9f67c82ce8952bbe9a7a07b732830363c7865485

    # from laptop, then unzip on prod machine
    scp chocula_fatcat_export.2020-11-19.json.gz fatcat-prod1-vm:/srv/fatcat/datasets/

    # check ISSN-L symlink
    # ISSN-to-ISSN-L.txt -> 20201119.ISSN-to-ISSN-L.txt

    export FATCAT_AUTH_WORKER_JOURNAL_METADATA=...
    head -n200 /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json | ./fatcat_import.py chocula -
    Counter({'total': 200, 'exists': 200, 'exists-by-issnl': 6, 'skip': 0, 'insert': 0, 'update': 0})

    head -n200 /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json | ./fatcat_import.py chocula - --do-updates
    Counter({'total': 200, 'exists': 157, 'exists-skip-update': 151, 'update': 43, 'exists-by-issnl': 6, 'skip': 0, 'insert': 0})

Some of these are very minor updates, so going to do just creation (no
`--do-updates`) to start.

    time ./fatcat_import.py chocula /srv/fatcat/datasets/chocula_fatcat_export.2020-11-19.json
    Counter({'total': 168165, 'exists': 167497, 'exists-by-issnl': 2371, 'insert': 668, 'skip': 0, 'update': 0})

    real    5m37.081s
    user    3m1.648s
    sys     0m9.488s

TODO: tweak chocula import script to not update on `extra.state` metadata.


## Release Metadata Bulk Import

This is the first production bulk import of DOAJ metadata!

    # git rev: 9f67c82ce8952bbe9a7a07b732830363c7865485
    # DB before: Size:  678.15G

    # ensure fatcatd is updated to have support for DOAJ identifier

    # create new bot user
    ./target/release/fatcat-auth create-editor --admin --bot doaj-bot
    => mir5imb3v5ctxcaqnbstvmri2a

    ./target/release/fatcat-auth create-token mir5imb3v5ctxcaqnbstvmri2a
    => ...

    # download dataset
    wget https://archive.org/download/doaj_data_2020-11-13/doaj_article_data_2020-11-13.sample_10k.json.gz
    wget https://archive.org/download/doaj_data_2020-11-13/doaj_article_data_2020-11-13_all.json.gz

    export FATCAT_AUTH_WORKER_DOAJ=...

    # start small
    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13.sample_10k.json.gz | head -n100 | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
    => Counter({'total': 100, 'exists': 70, 'insert': 30, 'skip': 0, 'update': 0})

That is about expected, in terms of fraction without DOI. However, 6 out of 10
(randomly checked) of the inserted releases seem to be dupes, which feels too
high. So going to pause this import until basic fuzzy matching ready from
Martin's fuzzycat work, and will check against elasticsearch before import.
Will shuffle the entire file, import in a single thread, and just skip
importing if there is any fuzzy match (not try to merge/update). Expecting
about 500k new releases after such filtering.

    # full run (TODO)
    zcat /srv/fatcat/datasets/doaj_article_data_2020-11-13_all.json.gz | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -