notes/ingest/2020-03-02_ingests.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174


## protocols.io

Tested that single ingest is working, and they fixed PDF format on their end
recently.

    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa container --name protocols.io
    => Expecting 8448 release objects in search queries
    => Counter({'estimate': 8448, 'kafka': 8448, 'ingest_request': 8448, 'elasticsearch_release': 8448})

## backfill follow-ups

- re-ingest all degruyter (doi_prefix:10.1515)
    89942   doi:10.1515\/* is_oa:true
    36350   doi:10.1515\/* in_ia:false is_oa:true
    40034   publisher:Gruyter is_oa:true in_ia:false
    => update:
    135926  doi:10.1515\/* is_oa:true
    50544   doi:10.1515\/* in_ia:false is_oa:true
    54880   publisher:Gruyter is_oa:true in_ia:false
- re-ingest all frontiersin
    36093   publisher:frontiers is_oa:true in_ia:false
    => update
    22444   publisher:frontiers is_oa:true in_ia:false
    22029   doi_prefix:10.3389 is_oa:true in_ia:false

    select status, count(*) from ingest_file_result where base_url like 'https://doi.org/10.3389/%' group by status order by count(*) desc;

                   status                | count 
    -------------------------------------+-------
     success                             | 34721
     no-pdf-link                         | 18157
     terminal-bad-status                 |  6799
     cdx-error                           |  1805
     wayback-error                       |   333
     no-capture                          |   301
    [...]

    select * from ingest_file_result where base_url like 'https://doi.org/10.17723/aarc%' and status = 'no-pdf-link' order by updated desc limit 100;

- re-ingest all mdpi
    43114   publisher:mdpi is_oa:true in_ia:false
    => update
    8548    publisher:mdpi is_oa:true in_ia:false

    select status, count(*) from ingest_file_result where base_url like 'https://doi.org/10.3390/%' group by status order by count(*) desc;
                   status                | count  
    -------------------------------------+--------
     success                             | 108971
     cdx-error                           |   6655
     wrong-mimetype                      |   3359
     terminal-bad-status                 |   1299
     wayback-error                       |    151
     spn2-cdx-lookup-failure             |     87

     => added hack for gzip content-encoding coming through pdf fetch
     => will re-ingest all after pushing fix

- re-ingest all ahajournals.org
    132000  doi:10.1161\/*
    6606    doi:10.1161\/* in_ia:false is_oa:true
    81349   publisher:"American Heart Association"
    5986    publisher:"American Heart Association" is_oa:true in_ia:false
    => update
    1337    publisher:"American Heart Association" is_oa:true in_ia:false

                   status                | count 
    -------------------------------------+-------
     success                             |  1480
     cdx-error                           |  1176
     spn2-cdx-lookup-failure             |   514
     no-pdf-link                         |    85
     wayback-error                       |    25
     spn2-error:job-failed               |    18

    => will re-run errors
- re-ingest all ehp.niehs.nih.gov
    25522   doi:10.1289\/*
    15315   publisher:"Environmental Health Perspectives"
     8779   publisher:"Environmental Health Perspectives" in_ia:false
    12707   container_id:3w6amv3ecja7fa3ext35ndpiky in_ia:false is_oa:true
    => update
    7547    container_id:3w6amv3ecja7fa3ext35ndpiky in_ia:false is_oa:true
- re-ingest all journals.tsu.ru
    12232   publisher:"Tomsk State University"
    11668   doi:10.17223\/*
     4861   publisher:"Tomsk State University" in_ia:false is_oa:true
    => update
    2605    publisher:"Tomsk State University" in_ia:false is_oa:true
    => just need to retry these? seem fine
- re-ingest all www.cogentoa.com
    3421898 doi:10.1080\/*
    4602    journal:cogent is_oa:true in_ia:false
    5631    journal:cogent is_oa:true (let's recrawl all from publisher domain)
    => update
    254     journal:cogent is_oa:true in_ia:false
- re-ingest chemrxiv
    8281    doi:10.26434\/chemrxiv*
    6918    doi:10.26434\/chemrxiv* in_ia:false
    => update
    4890    doi:10.26434\/chemrxiv* in_ia:false
    => re-ingest
    => allow non-OA

    # american archivist
    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa container --container-id zpobyv4vbranllc7oob56tgci4
    Counter({'estimate': 2920, 'elasticsearch_release': 2920, 'kafka': 2911, 'ingest_request': 2911})
    => 2020-02-04: 85 / 3,005
    => 2020-03-02: 2,182 / 3,005 preserved. some no-pdf-link, otherwise just a bunch of spn2-error
    => looks like the no-pdf-url due to pinnacle-secure.allenpress.com soft-blocking loop


## backfill re-ingests

    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa --force-recrawl container --container-id zpobyv4vbranllc7oob56tgci4
    => Counter({'elasticsearch_release': 823, 'estimate': 823, 'ingest_request': 814, 'kafka': 814})

    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org container --publisher Gruyter
    => Counter({'elasticsearch_release': 54880, 'estimate': 54880, 'kafka': 51497, 'ingest_request': 51497})

    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org query 'publisher:"Tomsk State University"'
    => Counter({'ingest_request': 2605, 'kafka': 2605, 'elasticsearch_release': 2605, 'estimate': 2605})

    ./fatcat_ingest.py --limit 25 --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org query "doi:10.26434\/chemrxiv*"

    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org container --publisher mdpi
    => Counter({'estimate': 8548, 'elasticsearch_release': 8548, 'ingest_request': 6693, 'kafka': 6693})
    => NOTE: about 2k not enqueued

## re-ingest all broken

    COPY (
        SELECT row_to_json(ingest_request.*) FROM ingest_request
        LEFT JOIN ingest_file_result ON ingest_file_result.base_url = ingest_request.base_url
        WHERE ingest_request.ingest_type = 'pdf'
            AND ingest_file_result.ingest_type = 'pdf'
            AND ingest_file_result.updated < NOW() - '1 day'::INTERVAL
            AND ingest_file_result.hit = false
            AND ingest_file_result.status like 'spn2-%'
    ) TO '/grande/snapshots/reingest_spn2_20200302.rows.json';
    => COPY 14849

    COPY (
        SELECT row_to_json(ingest_request.*) FROM ingest_request
        LEFT JOIN ingest_file_result ON ingest_file_result.base_url = ingest_request.base_url
        WHERE ingest_request.ingest_type = 'pdf'
            AND ingest_file_result.ingest_type = 'pdf'
            AND ingest_file_result.hit = false
            AND ingest_file_result.status like 'cdx-error'
    ) TO '/grande/snapshots/reingest_cdxerr_20200302.rows.json';
    => COPY 507610

    This is a huge number! Re-ingest via bulk?

Transform:

    ./scripts/ingestrequest_row2json.py /grande/snapshots/reingest_spn2_20200302.rows.json > reingest_spn2_20200302.json
    ./scripts/ingestrequest_row2json.py /grande/snapshots/reingest_cdxerr_20200302.rows.json > reingest_cdxerr_20200302.json

Push to kafka:

    cat reingest_spn2err_20200218.json | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests -p -1
    # accidentially also piped the above through ingest-file-requests-bulk...
    # which could actually be bad
    cat reingest_cdxerr_20200302.json | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

## biorxiv/medrxiv

    8026    doi:10.1101\/20*
    2159    doi:10.1101\/20* in_ia:false

    ./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa query 'doi:10.1101\/20* in_ia:false'
    => Counter({'estimate': 2159, 'ingest_request': 2159, 'elasticsearch_release': 2159, 'kafka': 2159})