1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
|
After the broad datacite crawl, want to ingest paper PDFs into fatcat. But many
of the DOIs are for, eg, datasets, and don't want to waste time on those.
Instead of using full ingest request file from the crawl, will generate a new
ingest request file using `fatcat_ingest.py` and set that up for bulk crawling.
## Generate Requests
./fatcat_ingest.py --allow-non-oa --release-types article-journal,paper-conference,article,report,thesis,book,chapter query "doi_registrar:datacite" | pv -l > /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json
=> Expecting 8905453 release objects in search queries
=> 8.91M 11:49:50 [ 209 /s]
=> Counter({'elasticsearch_release': 8905453, 'ingest_request': 8905453, 'estimate': 8905453})
## Bulk Ingest
cat /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
## Ingest Stats
Note that this will have a small fraction of non-datacite results mixed in (eg,
from COVID-19 targeted crawls):
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'doi'
AND ingest_request.ingest_type = 'pdf'
AND ingest_request.ingest_request_source = 'fatcat-ingest'
AND created >= '2020-04-07'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------------------+---------
no-pdf-link | 4646767
redirect-loop | 1447229
no-capture | 860235
success | 849501
terminal-bad-status | 174869
cdx-error | 159805
wayback-error | 18076
wrong-mimetype | 11169
link-loop | 8410
gateway-timeout | 4034
spn2-cdx-lookup-failure | 510
petabox-error | 339
null-body | 251
spn2-error | 19
spn2-error:job-failed | 14
bad-gzip-encoding | 13
timeout | 5
spn2-error:soft-time-limit-exceeded | 4
invalid-host-resolution | 2
spn2-error:pending | 1
(20 rows)
Top domains/statuses (including success):
SELECT domain, status, COUNT((domain, status))
FROM (
SELECT
ingest_file_result.ingest_type,
ingest_file_result.status,
substring(ingest_file_result.terminal_url FROM '[^/]+://([^/]*)') AS domain
FROM ingest_file_result
LEFT JOIN ingest_request
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'doi'
AND ingest_request.ingest_type = 'pdf'
AND ingest_request.ingest_request_source = 'fatcat-ingest'
AND created >= '2020-04-07'
) t1
WHERE t1.domain != ''
AND t1.status != 'success'
GROUP BY domain, status
ORDER BY COUNT DESC
LIMIT 30;
domain | status | count
---------------------------------------+---------------------+--------
ssl.fao.org | no-pdf-link | 862277
www.e-periodica.ch | no-pdf-link | 746781
www.researchgate.net | redirect-loop | 664524
dlc.library.columbia.edu | no-pdf-link | 493111
www.die-bonn.de | redirect-loop | 352903
figshare.com | no-pdf-link | 319709
statisticaldatasets.data-planet.com | no-pdf-link | 309584
catalog.paradisec.org.au | redirect-loop | 225396
zenodo.org | no-capture | 193201
digi.ub.uni-heidelberg.de | no-pdf-link | 184974
open.library.ubc.ca | no-pdf-link | 167841
zenodo.org | no-pdf-link | 130617
www.google.com | no-pdf-link | 111312
www.e-manuscripta.ch | no-pdf-link | 79192
ds.iris.edu | no-pdf-link | 77649
data.inra.fr | no-pdf-link | 69440
www.tib.eu | no-pdf-link | 63872
www.egms.de | redirect-loop | 53877
archaeologydataservice.ac.uk | redirect-loop | 52838
d.lib.msu.edu | no-pdf-link | 45297
www.e-rara.ch | no-pdf-link | 45163
springernature.figshare.com | no-pdf-link | 42527
boris.unibe.ch | no-pdf-link | 40816
www.research-collection.ethz.ch | no-capture | 40350
spectradspace.lib.imperial.ac.uk:8443 | no-pdf-link | 33059
repository.dri.ie | terminal-bad-status | 32760
othes.univie.ac.at | no-pdf-link | 32558
repositories.lib.utexas.edu | no-capture | 31526
posterng.netkey.at | no-pdf-link | 30315
zenodo.org | terminal-bad-status | 29614
(30 rows)
|