summaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2021-05-28_doaj.md
blob: e5925eeb45df68201ecec2065de25d6cf94aaecb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

Note: running 2021-05-28 pacific time, but 2021-05-29 UTC.

First downloaded bulk metadata from doaj.org and uploaded into archive.org item
as a snapshot.

## Journal Metadata Import

Before doing article import, want to ensure journals all exist.

Use chocula pipeline, and to be simple/conservative don't update any
containers, just create if they don't already exist.

Run the usual chocula source update, copy to data dir, update sources.toml,
etc. Didn't bother with updating container counts or homepage status. Then
export container schema:

    python -m chocula export_fatcat | gzip > chocula_fatcat_export.2021-05-28.json.gz

Upload to fatcat prod machine, unzip, and then import to prod:

    export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
    ./fatcat_import.py chocula /srv/fatcat/datasets/chocula_fatcat_export.2021-05-28.json
    => Counter({'total': 175837, 'exists': 170358, 'insert': 5479, 'exists-by-issnl': 5232, 'skip': 0, 'update': 0})

That is a healthy batch of new records!

## Article Import

Transform all the articles into a single JSON file:

    cat doaj_article_data_*/article_batch*.json | jq .[] -c | pv -l | gzip > doaj_article_data_2021-05-25_all.json.gz
    => 6.1M 0:18:45 [5.42k/s]

    zcat doaj_article_data_2021-05-25_all.json.gz | shuf -n10000 > doaj_article_data_2021-05-25_sample_10k.json

Also upload this `_all` file to archive.org item.

Ready to import! Start with sample:

    export FATCAT_AUTH_WORKER_DOAJ=...
    zcat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_sample_10k.json.gz | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
    => Counter({'total': 10000, 'exists': 8743, 'exists-fuzzy': 1044, 'insert': 197, 'skip': 14, 'skip-title': 14, 'skip-doaj-id-mismatch': 2, 'update': 0})

Then the full import, in parallel, shuffled (because we shuffled last time):

    zcat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_all.json.gz | shuf | pv -l | parallel -j12 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
    => Counter({'total': 512351, 'exists': 449996, 'exists-fuzzy': 50858, 'insert': 10826, 'skip': 551, 'skip-title': 551, 'skip-doaj-id-mismatch': 120, 'update': 0})
    => extrapolating, about 129,912 new release entities. 2.1% insert rate

NOTE: large number of warnings like:

    UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=5d2ebb760ad24ce68ec8079bc82c8d78 ident=dvtl2xpn4nespfrj6gad6mrk44

## Extra Article Imports

Manually disabled fuzzy matching with patch:

    diff --git a/python/fatcat_import.py b/python/fatcat_import.py
    index 1dcfec2..cb787cb 100755
    --- a/python/fatcat_import.py
    +++ b/python/fatcat_import.py
    @@ -260,6 +260,7 @@ def run_doaj_article(args):
             args.issn_map_file,
             edit_batch_size=args.batch_size,
             do_updates=args.do_updates,
    +        do_fuzzy_match=False,
         )
         if args.kafka_mode:
             KafkaJsonPusher(

Filtered out some specific articles:

    zcat doaj_article_data_2021-05-25_all.json.gz | rg 1665-1596 | pv -l > doaj_article_data_2021-05-25_voces.json
    => 154  0:02:05 [1.22 /s]

And used this for some imports:

    cat /srv/fatcat/tasks/202105_doaj/doaj_article_data_2021-05-25_voces.json | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -