1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
|
A new snapshot was released in April 2020 (the snapshot is from 2020-02-25, but
not released for more than a month).
Primary goal is:
- generate ingest requests for only *new* URLs
- bulk ingest these new URLs
- crawl any no-capture URLs from that batch
- re-bulk-ingest the no-capture batch
- analytics on failed ingests. eg, any particular domains that are failing to crawl
This ingest pipeline was started on 2020-04-07 by bnewbold.
Ran through the first two steps again on 2020-05-03 after unpaywall had
released another dump (dated 2020-04-27).
## Transform and Load
# in sandcrawler pipenv on aitio
zcat /schnell/UNPAYWALL-PDF-CRAWL-2020-04/unpaywall_snapshot_2020-02-25T115244.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-02-25.ingest_request.json
=> 24.7M 5:17:03 [ 1.3k/s]
cat /grande/snapshots/unpaywall_snapshot_2020-02-25.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
=> 24.7M
=> Worker: Counter({'total': 24712947, 'insert-requests': 4282167, 'update-requests': 0})
Second time:
# in sandcrawler pipenv on aitio
zcat /schnell/UNPAYWALL-PDF-CRAWL-2020-04/unpaywall_snapshot_2020-04-27T153236.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json
=> 25.2M 3:16:28 [2.14k/s]
cat /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
=> Worker: Counter({'total': 25189390, 'insert-requests': 1408915, 'update-requests': 0})
=> JSON lines pushed: Counter({'pushed': 25189390, 'total': 25189390})
## Dump new URLs and Bulk Ingest
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-04-01'
AND ingest_file_result.status IS NULL
) TO '/grande/snapshots/unpaywall_noingest_2020-04-08.rows.json';
=> 3696189
WARNING: forgot to transform from rows to ingest requests.
cat /grande/snapshots/unpaywall_noingest_2020-04-08.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
Second time:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-05-01'
AND ingest_file_result.status IS NULL
) TO '/grande/snapshots/unpaywall_noingest_2020-05-03.rows.json';
=> 1799760
WARNING: forgot to transform from rows to ingest requests.
cat /grande/snapshots/unpaywall_noingest_2020-05-03.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
## Dump no-capture, Run Crawl
Make two ingest request dumps: one with "all" URLs, which we will have heritrix
attempt to crawl, and then one with certain domains filtered out, which we may
or may not bother trying to ingest (due to expectation of failure).
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-04-01'
AND ingest_file_result.status = 'no-capture'
) TO '/grande/snapshots/unpaywall_nocapture_all_2020-05-04.rows.json';
=> 2734145
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-04-01'
AND ingest_file_result.status = 'no-capture'
AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
) TO '/grande/snapshots/unpaywall_nocapture_2020-05-04.rows.json';
=> 2602408
NOTE: forgot here to transform from "rows" to ingest requests.
Not actually a very significant size difference after all.
See `journal-crawls` repo for details on seedlist generation and crawling.
## Re-Ingest Post-Crawl
NOTE: if we *do* want to do cleanup eventually, could look for fatcat edits
between 2020-04-01 and 2020-05-25 which have limited "extra" metadata (eg, no
evidence or `oa_status`).
The earlier bulk ingests were done wrong (forgot to transform from rows to full
ingest request docs), so going to re-do those, which should be a superset of
the nocapture crawl URLs.:
./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-04-08.rows.json | pv -l > /grande/snapshots/unpaywall_noingest_2020-04-08.json
=> 1.26M 0:00:58 [21.5k/s]
=> previously: 3,696,189
./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-05-03.rows.json | pv -l > /grande/snapshots/unpaywall_noingest_2020-05-03.json
=> 1.26M 0:00:56 [22.3k/s]
Crap, looks like the 2020-04-08 segment got overwriten with 2020-05 data by
accident. Hrm... need to re-ingest *all* recent unpaywall URLs:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-04-01'
) TO '/grande/snapshots/unpaywall_all_recent_requests_2020-05-26.rows.json';
=> COPY 5691106
./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json
=> 5.69M 0:04:26 [21.3k/s]
Start small:
cat /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json | head -n200 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
Looks good (whew), run the full thing:
cat /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
## Post-ingest stats (2020-08-28)
Overall status:
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------------------+----------
success | 22063013
no-pdf-link | 2192606
redirect-loop | 1471135
terminal-bad-status | 995106
no-capture | 359440
cdx-error | 358909
wrong-mimetype | 111685
wayback-error | 50705
link-loop | 29359
null-body | 13667
gateway-timeout | 3689
spn2-cdx-lookup-failure | 1229
petabox-error | 1007
redirects-exceeded | 747
invalid-host-resolution | 464
spn2-error | 107
spn2-error:job-failed | 91
bad-redirect | 26
spn2-error:soft-time-limit-exceeded | 9
bad-gzip-encoding | 5
(20 rows)
Failures by domain:
SELECT domain, status, COUNT((domain, status))
FROM (
SELECT
ingest_file_result.ingest_type,
ingest_file_result.status,
substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
FROM ingest_file_result
LEFT JOIN ingest_request
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_file_result.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
) t1
WHERE t1.domain != ''
AND t1.status != 'success'
AND t1.status != 'no-capture'
GROUP BY domain, status
ORDER BY COUNT DESC
LIMIT 30;
domain | status | count
-----------------------------------+---------------------+--------
academic.oup.com | no-pdf-link | 415441
watermark.silverchair.com | terminal-bad-status | 345937
www.tandfonline.com | no-pdf-link | 262488
journals.sagepub.com | no-pdf-link | 235707
onlinelibrary.wiley.com | no-pdf-link | 225876
iopscience.iop.org | terminal-bad-status | 170783
www.nature.com | redirect-loop | 145522
www.degruyter.com | redirect-loop | 131898
files-journal-api.frontiersin.org | terminal-bad-status | 126091
pubs.acs.org | no-pdf-link | 119223
society.kisti.re.kr | no-pdf-link | 112401
www.ahajournals.org | no-pdf-link | 105953
dialnet.unirioja.es | terminal-bad-status | 96505
www.cell.com | redirect-loop | 87560
www.ncbi.nlm.nih.gov | redirect-loop | 49890
ageconsearch.umn.edu | redirect-loop | 45989
ashpublications.org | no-pdf-link | 45833
pure.mpg.de | redirect-loop | 45278
www.degruyter.com | terminal-bad-status | 43642
babel.hathitrust.org | terminal-bad-status | 42057
osf.io | redirect-loop | 41119
scialert.net | no-pdf-link | 39009
dialnet.unirioja.es | redirect-loop | 38839
www.jci.org | redirect-loop | 34209
www.spandidos-publications.com | redirect-loop | 33167
www.journal.csj.jp | no-pdf-link | 30915
journals.openedition.org | redirect-loop | 30409
www.valueinhealthjournal.com | redirect-loop | 30090
dergipark.org.tr | no-pdf-link | 29146
journals.ametsoc.org | no-pdf-link | 29133
(30 rows)
Enqueue internal failures for re-ingest:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND (
ingest_file_result.status = 'cdx-error' OR
ingest_file_result.status = 'wayback-error'
)
) TO '/grande/snapshots/unpaywall_errors_2020-08-28.rows.json';
=> 409606
./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_errors_2020-08-28.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_errors_2020-08-28.requests.json
cat /grande/snapshots/unpaywall_errors_2020-08-28.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
And after *that* (which ran quickly):
status | count
-------------------------------------+----------
success | 22281874
no-pdf-link | 2258352
redirect-loop | 1499251
terminal-bad-status | 1004781
no-capture | 401333
wrong-mimetype | 112068
cdx-error | 32259
link-loop | 30137
null-body | 13886
wayback-error | 11653
gateway-timeout | 3689
spn2-cdx-lookup-failure | 1229
petabox-error | 1036
redirects-exceeded | 749
invalid-host-resolution | 464
spn2-error | 107
spn2-error:job-failed | 91
bad-redirect | 26
spn2-error:soft-time-limit-exceeded | 9
bad-gzip-encoding | 5
(20 rows)
22063013 -> 22281874 = + 218,861 success, not bad!
|