1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
|
Primary Goal: start large crawl of OAI landing pages that we haven't seen
Fields of interest for ingest:
- oai identifer
- doi
- formats
- urls (maybe also "relations")
- types (type+stage)
## Other Tasks
About 150 million total lines.
Types coverage
zstdcat oai.ndjson.zst | pv -l | jq "select(.types != null) | .types[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > types_counts.txt
Dump all ISSNs, with counts, quick check how many are in chocula/fatcat
zstdcat oai.ndjson.zst | pv -l | jq "select(.issn != null) | .issn[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > issn_counts.txt
Language coverage
zstdcat oai.ndjson.zst | pv -l | jq "select(.languages != null) | .languages[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > languages_counts.txt
Format coverage
zstdcat oai.ndjson.zst | pv -l | jq "select(.formats != null) | .formats[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > formats_counts.txt
=> 150M 0:56:14 [44.7k/s]
Have a DOI?
zstdcat oai.ndjson.zst | pv -l | rg '"doi":' | rg '"10.' | wc -l
=> 16,013,503
zstdcat oai.ndjson.zst | pv -l | jq "select(.doi != null) | .doi[]" -r | sort -u -S 5G > doi_raw.txt
=> 11,940,950
## Transform, Load, Bulk Ingest
zstdcat oai.ndjson.zst | ./oai2ingestrequest.py - | pv -l | gzip > oai.202002.requests.json.gz
=> 80M 6:36:55 [3.36k/s]
time zcat /schnell/oai-pmh/oai.202002.requests.json.gz | pv -l | ./persist_tool.py ingest-request -
=> 80M 4:00:21 [5.55k/s]
=> Worker: Counter({'total': 80013963, 'insert-requests': 51169081, 'update-requests': 0})
=> JSON lines pushed: Counter({'pushed': 80013963, 'total': 80013963})
=> real 240m21.207s
=> user 85m12.576s
=> sys 3m29.580s
select count(*) from ingest_request where ingest_type = 'pdf' and link_source = 'oai';
=> 51,185,088
Why so many (30 million) skipped? Not unique?
zcat oai.202002.requests.json.gz | jq '[.link_source_id, .base_url]' -c | sort -u -S 4G | wc -l
=> 51,185,088
zcat oai.202002.requests.json.gz | jq .base_url -r | pv -l | sort -u -S 4G > request_url.txt
wc -l request_url.txt
=> 50,002,674 request_url.txt
zcat oai.202002.requests.json.gz | jq .link_source_id -r | pv -l | sort -u -S 4G > requires_oai.txt
wc -l requires_oai.txt
=> 34,622,083 requires_oai.txt
Yup, tons of duplication. And remember this is exact URL, not SURT or similar.
How many of these are URLs we have seen and ingested already?
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------+----------
| 49491452
success | 1469113
no-capture | 134611
redirect-loop | 59666
no-pdf-link | 8947
cdx-error | 7561
terminal-bad-status | 6704
null-body | 5042
wrong-mimetype | 879
wayback-error | 722
petabox-error | 198
gateway-timeout | 86
link-loop | 51
invalid-host-resolution | 24
spn2-cdx-lookup-failure | 22
spn2-error | 4
bad-gzip-encoding | 4
spn2-error:job-failed | 2
(18 rows)
Dump ingest requests:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
AND date(ingest_request.created) > '2020-05-01'
AND ingest_file_result.status IS NULL
) TO '/grande/snapshots/oai_noingest_20200506.rows.json';
=> COPY 49491452
WARNING: should have transformed from rows to requests here
cat /grande/snapshots/oai_noingest_20200506.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
## Crawl and re-ingest
Updated stats after ingest (NOTE: ingest requests not really formed correctly,
but doesn't matter because fatcat wasn't importing these anyways):
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------+----------
no-capture | 42565875
success | 5227609
no-pdf-link | 2156341
redirect-loop | 559721
cdx-error | 260446
wrong-mimetype | 148871
terminal-bad-status | 109725
link-loop | 92792
null-body | 30688
| 15287
petabox-error | 11109
wayback-error | 6261
skip-url-blocklist | 184
gateway-timeout | 86
bad-gzip-encoding | 25
invalid-host-resolution | 24
spn2-cdx-lookup-failure | 22
bad-redirect | 15
spn2-error | 4
spn2-error:job-failed | 2
(20 rows)
Dump again for crawling:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
AND date(ingest_request.created) > '2020-05-01'
AND (ingest_file_result.status = 'no-capture' or ingest_file_result.status = 'cdx-error')
) TO '/grande/snapshots/oai_tocrawl_20200526.rows.json';
Notes about crawl setup are in `journal-crawls` repo. Excluded the following domains:
4876135 www.kb.dk REMOVE: too large and generic
3110009 kb-images.kb.dk REMOVE: dead?
1274638 mdz-nbn-resolving.de REMOVE: maybe broken
982312 aggr.ukm.um.si REMOVE: maybe broken
And went from about 42,826,313 rows to 31,773,874 unique URLs to crawl, so
expecting at least 11,052,439 `no-capture` ingest results (and should probably
filter for these or even delete from the ingest request table).
Ingest progress:
2020-08-05 14:02: 32,571,018
2020-08-06 13:49: 31,195,169
2020-08-07 10:11: 29,986,169
2020-08-10 10:43: 26,497,196
2020-08-12 11:02: 23,811,845
2020-08-17 13:34: 19,460,502
2020-08-20 09:49: 15,069,507
2020-08-25 09:56: 9,397,035
2020-09-02 15:02: 305,889 (72k longest queue)
2020-09-03 14:30: done
## Post-ingest stats
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------+----------
no-capture | 16804277
no-pdf-link | 14895249
success | 13898603
redirect-loop | 2709730
cdx-error | 827024
terminal-bad-status | 740037
wrong-mimetype | 604242
link-loop | 532553
null-body | 95721
wayback-error | 41864
petabox-error | 19204
| 15287
gateway-timeout | 510
bad-redirect | 318
skip-url-blocklist | 184
bad-gzip-encoding | 114
timeout | 78
spn2-cdx-lookup-failure | 59
invalid-host-resolution | 19
blocked-cookie | 6
(20 rows)
Hrm, +8 million or so 'success', but that is a lot of no-capture. May be worth
dumping the full kafka result topic, filter to OAI requests, and extracting the
missing URLs.
Top counts by OAI prefix:
SELECT
oai_prefix,
COUNT(CASE WHEN status = 'success' THEN 1 END) as success,
COUNT(*) as total
FROM (
SELECT
ingest_file_result.status as status,
-- eg "oai:cwi.nl:4881"
substring(ingest_request.link_source_id FROM 'oai:([^:]+):.*') AS oai_prefix
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
) t1
GROUP BY oai_prefix
ORDER BY total DESC
LIMIT 25;
oai_prefix | success | total
--------------------------+---------+---------
kb.dk | 0 | 7989412 (excluded)
repec | 1118591 | 2783448
bnf.fr | 0 | 2187277
hispana.mcu.es | 19404 | 1492639
bdr.oai.bsb-muenchen.de | 73 | 1319882 (excluded?)
hal | 564700 | 1049607
ukm.si | 0 | 982468 (excluded)
hsp.org | 0 | 810281
www.irgrid.ac.cn | 17578 | 748828
cds.cern.ch | 72811 | 688091
americanae.aecid.es | 69678 | 572792
biodiversitylibrary.org | 2121 | 566154
juser.fz-juelich.de | 22777 | 518551
espace.library.uq.edu.au | 6494 | 508960
igi.indrastra.com | 58689 | 478577
archive.ugent.be | 63654 | 424014
hrcak.srce.hr | 395031 | 414897
zir.nsk.hr | 153889 | 397200
renati.sunedu.gob.pe | 78399 | 388355
hypotheses.org | 3 | 374296
rour.neicon.ru | 7963 | 354529
generic.eprints.org | 261221 | 340470
invenio.nusl.cz | 6184 | 325867
evastar-karlsruhe.de | 62044 | 317952
quod.lib.umich.edu | 5 | 309135
(25 rows)
Top counts by OAI prefix and status:
SELECT
oai_prefix,
status,
COUNT((oai_prefix,status))
FROM (
SELECT
ingest_file_result.status as status,
-- eg "oai:cwi.nl:4881"
substring(ingest_request.link_source_id FROM 'oai:([^:]+):.*') AS oai_prefix
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
) t1
GROUP BY oai_prefix, status
ORDER BY COUNT DESC
LIMIT 30;
oai_prefix | status | count
--------------------------+---------------+---------
kb.dk | no-capture | 7955231 (excluded)
bdr.oai.bsb-muenchen.de | no-capture | 1270209 (excluded?)
repec | success | 1118591
hispana.mcu.es | no-pdf-link | 1118092
bnf.fr | no-capture | 1100591
ukm.si | no-capture | 976004 (excluded)
hsp.org | no-pdf-link | 773496
repec | no-pdf-link | 625629
bnf.fr | no-pdf-link | 607813
hal | success | 564700
biodiversitylibrary.org | no-pdf-link | 531409
cds.cern.ch | no-capture | 529842
repec | redirect-loop | 504393
juser.fz-juelich.de | no-pdf-link | 468813
bnf.fr | redirect-loop | 436087
americanae.aecid.es | no-pdf-link | 409954
hrcak.srce.hr | success | 395031
www.irgrid.ac.cn | no-pdf-link | 362087
hal | no-pdf-link | 352111
www.irgrid.ac.cn | no-capture | 346963
espace.library.uq.edu.au | no-pdf-link | 315302
igi.indrastra.com | no-pdf-link | 312087
repec | no-capture | 309882
invenio.nusl.cz | no-pdf-link | 302657
hypotheses.org | no-pdf-link | 298750
rour.neicon.ru | redirect-loop | 291922
renati.sunedu.gob.pe | no-capture | 276388
t2r2.star.titech.ac.jp | no-pdf-link | 264109
generic.eprints.org | success | 261221
quod.lib.umich.edu | no-pdf-link | 253937
(30 rows)
If we remove excluded prefixes, and some large/generic prefixes (bnf.fr,
hispana.mcu.es, hsp.org), then the aggregate counts are:
no-capture | 16,804,277 -> 5,502,242
no-pdf-link | 14,895,249 -> 12,395,848
Top status by terminal domain:
SELECT domain, status, COUNT((domain, status))
FROM (
SELECT
ingest_file_result.ingest_type,
ingest_file_result.status,
substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
FROM ingest_file_result
LEFT JOIN ingest_request
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_file_result.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
) t1
WHERE t1.domain != ''
GROUP BY domain, status
ORDER BY COUNT DESC
LIMIT 30;
domain | status | count
----------------------------------+---------------+--------
hispana.mcu.es | no-pdf-link | 709701 (national scope)
gallica.bnf.fr | no-pdf-link | 601193 (national scope)
discover.hsp.org | no-pdf-link | 524212 (historical)
www.biodiversitylibrary.org | no-pdf-link | 479288
gallica.bnf.fr | redirect-loop | 435981 (national scope)
hrcak.srce.hr | success | 389673
hemerotecadigital.bne.es | no-pdf-link | 359243
juser.fz-juelich.de | no-pdf-link | 345112
espace.library.uq.edu.au | no-pdf-link | 304299
invenio.nusl.cz | no-pdf-link | 302586
igi.indrastra.com | no-pdf-link | 292006
openrepository.ru | redirect-loop | 291555
hal.archives-ouvertes.fr | success | 278134
t2r2.star.titech.ac.jp | no-pdf-link | 263971
bib-pubdb1.desy.de | no-pdf-link | 254879
quod.lib.umich.edu | no-pdf-link | 250382
encounters.hsp.org | no-pdf-link | 248132
americanae.aecid.es | no-pdf-link | 245295
www.irgrid.ac.cn | no-pdf-link | 242496
publikationen.bibliothek.kit.edu | no-pdf-link | 222041
www.sciencedirect.com | no-pdf-link | 211756
dialnet.unirioja.es | redirect-loop | 203615
edoc.mpg.de | no-pdf-link | 195526
bibliotecadigital.jcyl.es | no-pdf-link | 184671
hal.archives-ouvertes.fr | no-pdf-link | 183809
www.sciencedirect.com | redirect-loop | 173439
lup.lub.lu.se | no-pdf-link | 165788
orbi.uliege.be | no-pdf-link | 158313
www.erudit.org | success | 155986
lib.dr.iastate.edu | success | 153384
(30 rows)
Follow-ups are TBD but could include:
- crawling the ~5m no-capture links directly (eg, not `base_url`) from the
ingest result JSON, while retaining the ingest request for later re-ingest
- investigating and iterating on PDF link extraction, both for large platforms
and randomly sampled from long tail
- classifying OAI prefixes by type (subject repository, institutional
repository, journal, national-library, historical docs, greylit, law, etc)
- running pdftrio over some/all of this corpus
|