notes/ingest/2020-10_unpaywall.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286


New snapshot released 2020-10-09. Want to do a mostly straight-forward
load/ingest/crawl.

Proposed changes this time around:

- have bulk ingest store missing URLs in a new sandcrawler-db for `no-capture`
  status, and to include those URLs in heritrix3 crawl
- tweak heritrix3 config for additional PDF URL extraction patterns,
  particularly to improve OJS yield


## Transform and Load

    # in sandcrawler pipenv on aitio
    zcat /schnell/unpaywall/unpaywall_snapshot_2020-10-09T153852.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-10-09.ingest_request.json
    => 28.3M 3:19:03 [2.37k/s]

    cat /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
    => 28.3M 1:11:29 [ 6.6k/s]
    => Worker: Counter({'total': 28298500, 'insert-requests': 4119939, 'update-requests': 0})
    => JSON lines pushed: Counter({'total': 28298500, 'pushed': 28298500})

## Dump new URLs, Transform, Bulk Ingest

    COPY (
        SELECT row_to_json(ingest_request.*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
            -- AND date(ingest_request.created) > '2020-10-09'
            AND (ingest_file_result.status IS NULL
                OR ingest_file_result.status = 'no-capture')
    ) TO '/grande/snapshots/unpaywall_noingest_2020-10-09.rows.json';
    => COPY 4216339

    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-10-09.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_noingest_2020-10-09.ingest_request.json
    => 4.22M 0:02:48 [  25k/s]

Start small, to test no-capture behavior:

    cat /grande/snapshots/unpaywall_noingest_2020-10-09.ingest_request.json | head -n1000 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

`no-capture` change looks good. Enqueue the whole batch:

    cat /grande/snapshots/unpaywall_noingest_2020-10-09.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1


## Check Pre-Crawl Status

    SELECT ingest_file_result.status, COUNT(*)
    FROM ingest_request
    LEFT JOIN ingest_file_result
        ON ingest_file_result.ingest_type = ingest_request.ingest_type
        AND ingest_file_result.base_url = ingest_request.base_url
    WHERE 
        ingest_request.ingest_type = 'pdf'
        AND ingest_request.link_source = 'unpaywall'
    GROUP BY status
    ORDER BY COUNT DESC
    LIMIT 20;


             status          |  count   
    -------------------------+----------
     success                 | 23661282
     no-capture              |  3015447
     no-pdf-link             |  2302102
     redirect-loop           |  1542566
     terminal-bad-status     |  1044676
     wrong-mimetype          |   114315
     link-loop               |    36358
     cdx-error               |    20150
     null-body               |    14513
     wayback-error           |    13644
     gateway-timeout         |     3776
     spn2-cdx-lookup-failure |     1260
     petabox-error           |     1171
     redirects-exceeded      |      752
     invalid-host-resolution |      464
     spn2-error              |      147
     bad-redirect            |      131
     spn2-error:job-failed   |       91
     wayback-content-error   |       45
     timeout                 |       19
    (20 rows)

## Dump Seedlist

Dump rows:

    COPY (
        SELECT row_to_json(t1.*)
        FROM (
            SELECT ingest_request.*, ingest_file_result as result
            FROM ingest_request
            LEFT JOIN ingest_file_result
                ON ingest_file_result.ingest_type = ingest_request.ingest_type
                AND ingest_file_result.base_url = ingest_request.base_url
            WHERE
                ingest_request.ingest_type = 'pdf'
                AND ingest_request.link_source = 'unpaywall'
                AND (ingest_file_result.status = 'no-capture'
                    OR ingest_file_result.status = 'cdx-error'
                    OR ingest_file_result.status = 'wayback-error'
                    OR ingest_file_result.status = 'gateway-timeout'
                    OR ingest_file_result.status = 'spn2-cdx-lookup-failure'
                )
                AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
                AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
                AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
                AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
                AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
                AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
                AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%'
                AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%'
                AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%'
                AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%'
        ) t1
    ) TO '/grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json';
    => 2,936,404

    # TODO: in the future also exclude "www.archive.org"

Prep ingest requests (for post-crawl use):

    ./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json | pv -l > /grande/snapshots/unpaywall_crawl_ingest_2020-11-02.json

And actually dump seedlist(s):

    cat /grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json | jq -r .base_url | sort -u -S 4G > /grande/snapshots/unpaywall_seedlist_2020-11-02.url.txt
    cat /grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json | rg '"no-capture"' | jq -r .result.terminal_url | rg -v ^null$ | sort -u -S 4G > /grande/snapshots/unpaywall_seedlist_2020-11-02.terminal_url.txt
    cat /grande/snapshots/unpaywall_seedlist_2020-11-02.rows.json | rg -v '"no-capture"' | jq -r .base_url | sort -u -S 4G > /grande/snapshots/unpaywall_seedlist_2020-11-02.no_terminal_url.txt

    wc -l unpaywall_seedlist_2020-11-02.*.txt
     2701178 unpaywall_seedlist_2020-11-02.terminal_url.txt
     2713866 unpaywall_seedlist_2020-11-02.url.txt

With things like jsessionid, suspect that crawling just the terminal URLs is
going to work better than both full and terminal.

Finding a fraction of `no-capture` which have partial/stub URLs as terminal.

TODO: investigate scale of partial/stub `terminal_url` (eg, not HTTP/S or FTP).


## Bulk Ingest and Status

Note, removing archive.org links:

    cat /grande/snapshots/unpaywall_crawl_ingest_2020-11-02.json | rg -v www.archive.org | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

Overall status (checked 2020-12-08):

    SELECT ingest_file_result.status, COUNT(*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
        GROUP BY status
        ORDER BY COUNT DESC
        LIMIT 20;

             status          |  count   
    -------------------------+----------
     success                 | 25004559
     no-pdf-link             |  2531841
     redirect-loop           |  1671375
     terminal-bad-status     |  1389463
     no-capture              |   893880
     wrong-mimetype          |   119332
     link-loop               |    66508
     wayback-content-error   |    30339
     cdx-error               |    21790
     null-body               |    20710
     wayback-error           |    13976
     gateway-timeout         |     3775
     petabox-error           |     2420
     spn2-cdx-lookup-failure |     1218
     redirects-exceeded      |      889
     invalid-host-resolution |      464
     bad-redirect            |      147
     spn2-error              |      112
     spn2-error:job-failed   |       91
     timeout                 |       21
    (20 rows)

Ingest stats broken down by publication stage:

    SELECT ingest_request.release_stage, ingest_file_result.status, COUNT(*)
        FROM ingest_request
        LEFT JOIN ingest_file_result
            ON ingest_file_result.ingest_type = ingest_request.ingest_type
            AND ingest_file_result.base_url = ingest_request.base_url
        WHERE
            ingest_request.ingest_type = 'pdf'
            AND ingest_request.link_source = 'unpaywall'
        GROUP BY release_stage, status
        ORDER BY release_stage, COUNT DESC
        LIMIT 100;


     release_stage |               status                |  count
    ---------------+-------------------------------------+----------
     accepted      | success                             |  1101090
     accepted      | no-pdf-link                         |    28590
     accepted      | redirect-loop                       |    10923
     accepted      | no-capture                          |     9540
     accepted      | terminal-bad-status                 |     6339
     accepted      | cdx-error                           |      952
     accepted      | wrong-mimetype                      |      447
     accepted      | link-loop                           |      275
     accepted      | wayback-error                       |      202
     accepted      | petabox-error                       |      177
     accepted      | redirects-exceeded                  |      122
     accepted      | null-body                           |       27
     accepted      | wayback-content-error               |       14
     accepted      | spn2-cdx-lookup-failure             |        5
     accepted      | gateway-timeout                     |        4
     accepted      | bad-redirect                        |        1
     published     | success                             | 18595278
     published     | no-pdf-link                         |  2434935
     published     | redirect-loop                       |  1364110
     published     | terminal-bad-status                 |  1185328
     published     | no-capture                          |   718792
     published     | wrong-mimetype                      |   112923
     published     | link-loop                           |    63874
     published     | wayback-content-error               |    30268
     published     | cdx-error                           |    17302
     published     | null-body                           |    15209
     published     | wayback-error                       |    10782
     published     | gateway-timeout                     |     1966
     published     | petabox-error                       |     1611
     published     | spn2-cdx-lookup-failure             |      879
     published     | redirects-exceeded                  |      760
     published     | invalid-host-resolution             |      453
     published     | bad-redirect                        |      115
     published     | spn2-error:job-failed               |       77
     published     | spn2-error                          |       75
     published     | timeout                             |       21
     published     | bad-gzip-encoding                   |        5
     published     | spn2-error:soft-time-limit-exceeded |        4
     published     | spn2-error:pending                  |        1
     published     | blocked-cookie                      |        1
     published     |                                     |        1
     published     | pending                             |        1
     submitted     | success                             |  5308166
     submitted     | redirect-loop                       |   296322
     submitted     | terminal-bad-status                 |   197785
     submitted     | no-capture                          |   165545
     submitted     | no-pdf-link                         |    68274
     submitted     | wrong-mimetype                      |     5962
     submitted     | null-body                           |     5474
     submitted     | cdx-error                           |     3536
     submitted     | wayback-error                       |     2992
     submitted     | link-loop                           |     2359
     submitted     | gateway-timeout                     |     1805
     submitted     | petabox-error                       |      632
     submitted     | spn2-cdx-lookup-failure             |      334
     submitted     | wayback-content-error               |       57
     submitted     | spn2-error                          |       37
     submitted     | bad-redirect                        |       31
     submitted     | spn2-error:job-failed               |       14
     submitted     |                                     |       12
     submitted     | invalid-host-resolution             |       11
     submitted     | redirects-exceeded                  |        7
     submitted     | spn2-error:soft-time-limit-exceeded |        5
     submitted     | bad-gzip-encoding                   |        1
     submitted     | skip-url-blocklist                  |        1
                   | no-pdf-link                         |       42
                   | success                             |       25
                   | redirect-loop                       |       20
                   | terminal-bad-status                 |       11
                   | no-capture                          |        3
    (70 rows)