aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2021-07_unpaywall.md
blob: 8b6ac092944a97c105cb6cd99e272fe19cdff4d4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320

New snapshot released 2021-07-02. Should be "boring" ingest and crawl.


## Transform and Load

    # in sandcrawler pipenv on sandcrawler1-vm (svc506)
    zcat /srv/sandcrawler/tasks/unpaywall_snapshot_2021-07-02T151134.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /srv/sandcrawler/tasks/unpaywall_snapshot_2021-07-02.ingest_request.json
    => 32.2M 3:01:52 [2.95k/s]

    cat /srv/sandcrawler/tasks/unpaywall_snapshot_2021-07-02.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
    => Worker: Counter({'total': 32196260, 'insert-requests': 3325954, 'update-requests': 0})
    => JSON lines pushed: Counter({'total': 32196260, 'pushed': 32196260})


## Dump new URLs, Transform, Bulk Ingest

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            -- AND date(ingest_request.created) > '2021-01-01'
            AND (ingest_file_result.status IS NULL
                OR ingest_file_result.status = 'no-capture')
    ) TO '/srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.rows.json';
    => COPY 3556146

    # previous, 2020-10 run: COPY 4216339
    # previous, 2021-07 run: COPY 3277484

Oops, should have run instead, with the date filter:

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) > '2021-07-01'
            AND (ingest_file_result.status IS NULL
                OR ingest_file_result.status = 'no-capture')
    ) TO '/srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.rows.json';

But didn't, so processed all instead.

    ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.rows.json | pv -l | shuf > /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.ingest_request.json
    => 3.56M 0:01:59 [29.8k/s]

Enqueue the whole batch:

    cat /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-02.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
    => done, on 2021-07-13


## Check Pre-Crawl Status

Only the recent bulk ingest:

    SELECT ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.ingest_type = 'pdf'
        AND ingest_request.link_source = 'unpaywall'
        AND date(ingest_request.created) > '2021-07-01'
    GROUP BY status
    ORDER BY COUNT DESC
    LIMIT 20;

             status          |  count
    -------------------------+---------
     no-capture              | 1831827
     success                 | 1343604
     redirect-loop           |  103999
     terminal-bad-status     |   19845
     no-pdf-link             |   17448
     link-loop               |    5027
     wrong-mimetype          |    2270
     cdx-error               |     523
     body-too-large          |     321
     null-body               |     298
     wayback-content-error   |     242
     petabox-error           |     155
     gateway-timeout         |     138
     invalid-host-resolution |     120
     wayback-error           |     109
     blocked-cookie          |       9
     timeout                 |       7
                             |       3
     bad-redirect            |       3
     spn2-cdx-lookup-failure |       3
    (20 rows)


## Dump Seedlist

Dump rows:

    COPY (
        SELECT row_to_json(t1.*)
        FROM (
            SELECT ingest_request.*, ingest_file_result as result
            FROM ingest_request
            LEFT JOIN ingest_file_result
                ON ingest_file_result.ingest_type = ingest_request.ingest_type
                AND ingest_file_result.base_url = ingest_request.base_url
            WHERE
                ingest_request.ingest_type = 'pdf'
                AND date(ingest_request.created) > '2021-07-01'
                AND ingest_request.link_source = 'unpaywall'
                AND (ingest_file_result.status = 'no-capture'
                    OR ingest_file_result.status = 'cdx-error'
                    OR ingest_file_result.status = 'wayback-error'
                    OR ingest_file_result.status = 'gateway-timeout'
                    OR ingest_file_result.status = 'spn2-cdx-lookup-failure'
                )
                AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
                AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
                AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
                AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
                AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
                AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
                AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
                AND ingest_request.base_url NOT LIKE '%.archive.org%'
                AND ingest_request.base_url NOT LIKE '%://archive.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%'
                AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%.archive.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%://archive.org%'
        ) t1
    ) TO '/srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json';
    => COPY 1743186

Prep ingest requests (for post-crawl use):

    ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json | pv -l > /srv/sandcrawler/tasks/unpaywall_crawl_ingest_2021-07-02.json
    => 1.74M 0:01:33 [18.6k/s]

And actually dump seedlist(s):

    cat /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json | jq -r .base_url | sort -u -S 4G > /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.url.txt
    cat /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json | rg '"no-capture"' | jq -r .result.terminal_url | rg -v ^null$ | sort -u -S 4G > /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.terminal_url.txt
    cat /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.rows.json | rg -v '"no-capture"' | jq -r .base_url | sort -u -S 4G > /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.no_terminal_url.txt

    wc -l /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.*.txt
            1 /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.no_terminal_url.txt
      1643963 /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.terminal_url.txt
      1644028 /srv/sandcrawler/tasks/unpaywall_seedlist_2021-07-02.url.txt
      3287992 total

Then run crawl (see `journal-crawls` git repo).

## Post-Crawl Bulk Ingest

    cat /srv/sandcrawler/tasks/unpaywall_crawl_ingest_2021-07-02.json | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
    => 1.74M 0:01:59 [14.6k/s]

## Post-Ingest Stats

Only the recent updates:

    SELECT ingest_file_result.status, COUNT(*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) > '2021-07-01'
        GROUP BY status
        ORDER BY COUNT DESC
        LIMIT 20;

             status          |  count  
    -------------------------+---------
     success                 | 2690258
     redirect-loop           |  227328
     no-capture              |  157368
     terminal-bad-status     |  118943
     no-pdf-link             |   92698
     blocked-cookie          |   19478
     link-loop               |    9249
     wrong-mimetype          |    4918
     cdx-error               |    1786
     wayback-error           |    1497
     null-body               |    1302
     body-too-large          |     433
     wayback-content-error   |     245
     petabox-error           |     171
     gateway-timeout         |     138
     invalid-host-resolution |     120
     timeout                 |      12
     bad-redirect            |       4
                             |       3
     spn2-cdx-lookup-failure |       1
    (20 rows)

Only the recent updates, by publication stage:

    SELECT ingest_request.release_stage, ingest_file_result.status, COUNT(*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) > '2021-07-01'
        GROUP BY release_stage, status
        ORDER BY release_stage, COUNT DESC
        LIMIT 100;

     release_stage |         status          |  count  
    ---------------+-------------------------+---------
     accepted      | success                 |  103144
     accepted      | no-pdf-link             |   53981
     accepted      | terminal-bad-status     |    4102
     accepted      | link-loop               |    2799
     accepted      | no-capture              |    2315
     accepted      | redirect-loop           |    2171
     accepted      | blocked-cookie          |     234
     accepted      | cdx-error               |     140
     accepted      | wayback-error           |     101
     accepted      | wrong-mimetype          |      38
     accepted      | null-body               |      10
     accepted      | petabox-error           |       5
     accepted      | wayback-content-error   |       4
     accepted      | gateway-timeout         |       2
     accepted      | body-too-large          |       2
     published     | success                 | 1919100
     published     | no-capture              |  130104
     published     | redirect-loop           |  127482
     published     | terminal-bad-status     |   43118
     published     | no-pdf-link             |   33505
     published     | blocked-cookie          |   19034
     published     | link-loop               |    6241
     published     | wrong-mimetype          |    4163
     published     | null-body               |    1195
     published     | cdx-error               |    1151
     published     | wayback-error           |    1105
     published     | wayback-content-error   |     197
     published     | body-too-large          |     195
     published     | petabox-error           |     118
     published     | gateway-timeout         |      35
     published     | invalid-host-resolution |      13
     published     | timeout                 |       8
     published     | bad-redirect            |       2
     published     | spn2-cdx-lookup-failure |       1
     published     | bad-gzip-encoding       |       1
     submitted     | success                 |  668014
     submitted     | redirect-loop           |   97675
     submitted     | terminal-bad-status     |   71723
     submitted     | no-capture              |   24949
     submitted     | no-pdf-link             |    5212
     submitted     | wrong-mimetype          |     717
     submitted     | cdx-error               |     495
     submitted     | wayback-error           |     291
     submitted     | body-too-large          |     236
     submitted     | blocked-cookie          |     210
     submitted     | link-loop               |     209
     submitted     | invalid-host-resolution |     107
     submitted     | gateway-timeout         |     101
     submitted     | null-body               |      97
     submitted     | petabox-error           |      48
     submitted     | wayback-content-error   |      44
     submitted     | timeout                 |       4
     submitted     |                         |       3
     submitted     | bad-redirect            |       2
     submitted     | remote-server-error     |       1
    (55 rows)

In total, this iteration of unpaywall ingest resulted in:

- 3,325,954 raw ingest requests (new URLs total)
- 1,743,186 (52% of all) of these had not been seen/crawled from any source yet (?), and attempted to crawl
- 1,346,654 (77% of crawled) success from new heritrix crawling
- 2,690,258 (80%) total success (including crawled initially for other reasons; out of all new URLs including those not expected to be success)

## Live Ingest Follow-Up

Will run SPN requests on the ~160k `no-capture` URLs:

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            AND date(ingest_request.created) > '2021-07-01'
            AND (ingest_file_result.status IS NULL
                OR ingest_file_result.status = 'no-capture')
    ) TO '/srv/sandcrawler/tasks/unpaywall_noingest_2021-07-30.rows.json';
    => COPY 157371

    ./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-30.rows.json | pv -l | shuf > /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-30.ingest_request.json
    => 157k 0:00:04 [31.6k/s]

Enqueue the whole batch:

    cat /srv/sandcrawler/tasks/unpaywall_noingest_2021-07-30.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests -p -1
    => DONE