aboutsummaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/2022-07-19_doaj.md
blob: d25f2dda3b8f2a120437d4f18cd5c95eb83db9d3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78

Doing a batch import of DOAJ articles. Will need to do another one of these
soon after setting up daily (OAI-PMH feed) ingest.

## Prep

    wget https://doaj.org/csv
    wget https://doaj.org/public-data-dump/journal
    wget https://doaj.org/public-data-dump/article

    mv csv journalcsv__doaj_20220719_2135_utf8.csv
    mv journal doaj_journal_data_2022-07-19.tar.gz
    mv article doaj_article_data_2022-07-19.tar.gz

    ia upload doaj_data_2022-07-19 -m collection:ia_biblio_metadata ../logo_cropped.jpg journalcsv__doaj_20220719_2135_utf8.csv doaj_journal_data_2022-07-19.tar.gz doaj_article_data_2022-07-19.tar.gz

    tar xvf doaj_journal_data_2022-07-19.tar.gz
    cat doaj_journal_data_*/journal_batch_*.json | jq .[] -c | pv -l | gzip > doaj_journal_data_2022-07-19_all.json.gz

    tar xvf doaj_article_data_2022-07-19.tar.gz
    cat doaj_article_data_*/article_batch*.json | jq .[] -c | pv -l | gzip > doaj_article_data_2022-07-19_all.json.gz

    ia upload doaj_data_2022-07-19 doaj_journal_data_2022-07-19_all.json.gz doaj_article_data_2022-07-19_all.json.gz

On fatcat machine:

    cd /srv/fatcat/datasets
    wget https://archive.org/download/doaj_data_2022-07-19/doaj_article_data_2022-07-19_all.json.gz

## Prod Article Import

    git rev: 582495f66e5e08b6e257360097807711e53008d4
    (includes DOAJ container-id required patch)

    date: Tue Jul 19 22:46:42 UTC 2022

    `doaj_id:*`: 1,335,195 hits

Start with sample:

    zcat /srv/fatcat/datasets/doaj_article_data_2022-07-19_all.json.gz | shuf -n1000 > /srv/fatcat/datasets/doaj_article_data_2022-07-19_sample.json

    export FATCAT_AUTH_WORKER_DOAJ=[...]
    cat /srv/fatcat/datasets/doaj_article_data_2022-07-19_sample.json | pv -l | ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
    # Counter({'total': 1000, 'exists': 895, 'exists-fuzzy': 93, 'insert': 9, 'skip': 3, 'skip-no-container': 3, 'update': 0})

Pretty few imports.

Full ingest:

    export FATCAT_AUTH_WORKER_DOAJ=[...]
    zcat /srv/fatcat/datasets/doaj_article_data_2022-07-19_all.json.gz | pv -l | parallel -j6 --round-robin --pipe ./fatcat_import.py doaj-article --issn-map-file /srv/fatcat/datasets/ISSN-to-ISSN-L.txt -
    # Counter({'total': 1282908, 'exists': 1145439, 'exists-fuzzy': 117120, 'insert': 16357, 'skip': 3831, 'skip-no-container': 2641, 'skip-title': 1190, 'skip-doaj-id-mismatch': 161, 'update': 0})

Times 6x, around 100k releases added.

Got a bunch of:

    /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=fcdb7a7a9729403d8d99a21f6970dd1d ident=wesvmjwihvblzayfmrvvgr4ulm
    warnings.warn(warn_str)
    /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=1455dfe24583480883dbbb293a4bc0c6 ident=lfw57esesjbotms3grvvods5dq
    warnings.warn(warn_str)
    /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=88fa65a33c8e484091fc76f4cda59c25 ident=22abqt5qe5e7ngjd5fkyvzyc4q
    warnings.warn(warn_str)
    /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=eb7b03dc3dc340cea36891a68a50cce7 ident=ljedohlfyzdkxebgpcswjtd77q
    warnings.warn(warn_str)
    /1/srv/fatcat/src/python/fatcat_tools/importers/doaj_article.py:233: UserWarning: unexpected DOAJ ext_id match after lookup failed doaj=519617147ce248ea88d45ab098342153 ident=a63bqkttrbhyxavfr7li2w2xf4

Should investigate!

Also, noticed that DOAJ importer is hitting `api.fatcat.wiki`, not the public
API endpoint. Guessing this is via fuzzycat.

1,434,266 results for `doaj_id:*`.

Then did a follow-up sandcrawler ingest, see notes in that repository. Note
that newer ingest can crawl doaj.org, bypassing the sandcrawler SQL load, but
the direct crawling is probably still faster.