1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
|
Could roll this in to current patch crawl instead of starting a new crawl from scratch.
This file is misnamed; these are mostly non-DOI-specific small updates.
## KBART "almost complete" experimentation
Random 10 releases:
cat missing_releases.json | shuf -n10 | jq .ident -r | awk '{print "https://fatcat.wiki/release/" $1}'
https://fatcat.wiki/release/suggmo4fnfaave64frttaqqoja - domain gone
https://fatcat.wiki/release/uw2dq2p3mzgolk4alze2smv7bi - DOAJ, then OJS PDF link. sandcrawler failed, fixed
https://fatcat.wiki/release/fjamhzxxdndq5dcariobxvxu3u - OJS; sandcrawler fix works
https://fatcat.wiki/release/z3ubnko5ifcnbhhlegc24kya2u - OJS; sandcrawler failed, fixed (separate pattern)
https://fatcat.wiki/release/pysc3w2cdbehvffbyca4aqex3i - DOAJ, OJS bilingual, failed with 'redirect-loop'. force re-crawl worked for one copy
https://fatcat.wiki/release/am2m5agvjrbvnkstke3o3xtney - not attempted previously (?), success
https://fatcat.wiki/release/4zer6m56zvh6fd3ukpypdu7ita - cover page of journal (not an article). via crossref
https://fatcat.wiki/release/6njc4rdaifbg5jye3bbfdhkbsu - OJS; success
https://fatcat.wiki/release/jnmip3z7xjfsdfeex4piveshvu - OJS; not crawled previously; success
https://fatcat.wiki/release/wjxxcknnpjgtnpbzhzge6rkndi - no-pdf-link, fixed
Try some more!
https://fatcat.wiki/release/ywidvbhtfbettmfj7giu2htbdm - not attempted, success
https://fatcat.wiki/release/ou2kqv5k3rbk7iowfohpitelfa - OJS, not attempted, success?
https://fatcat.wiki/release/gv2glplmofeqrlrvfs524v5qa4 - scirp.org; 'redirect-loop'; HTML/PDF/XML all available; then 'gateway-timeout' on retry
https://fatcat.wiki/release/5r5wruxyyrf6jneorux3negwpe - gavinpublishers.com; broken site
https://fatcat.wiki/release/qk4atst6svg4hb73jdwacjcacu - horyzonty.ignatianum.edu.pl; broken DOI
https://fatcat.wiki/release/mp5ec3ycrjauxeve4n4weq7kqm - old cert; OJS; success
https://fatcat.wiki/release/sqnovcsmizckjdlwg3hipxrfqm - not attempted, success
https://fatcat.wiki/release/42ruewjuvbblxgnek6fpj5lp5m - OJS URL, but domain broken
https://fatcat.wiki/release/crg6aiypx5enveldvmwy5judp4 - volume/cover (stub)
https://fatcat.wiki/release/jzih3vvxj5ctxk3tbzyn5kokha - success
## Seeds: fixed OJS URLs
Made some recent changes to sandcrawler, should re-attempt OJS URLs, particularly from DOI or DOAJ, with pattern like:
- `no-pdf-link` with terminal URL like `/article/view/`
- `redirect-loop` with terminal URL like `/article/view/`
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_file_result.status = 'no-pdf-link'
AND (
ingest_file_result.terminal_url LIKE '%/article/view/%'
OR ingest_file_result.terminal_url LIKE '%/article/download/%'
)
AND (
ingest_request.link_source = 'doi'
OR ingest_request.link_source = 'doaj'
OR ingest_request.link_source = 'unpaywall'
)
) TO '/srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json';
=> COPY 326577
./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.rows.json > /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json
cat /srv/sandcrawler/tasks/retry_ojs_nopdflink.2022-01-13.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
Done/running.
COPY (
SELECT ingest_file_result.terminal_url
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND (
ingest_file_result.status = 'redirect-loop'
OR ingest_file_result.status = 'link-loop'
)
AND (
ingest_file_result.terminal_url LIKE '%/article/view/%'
OR ingest_file_result.terminal_url LIKE '%/article/download/%'
)
) TO '/srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt';
=> COPY 342415
cat /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.txt | awk '{print "F+ " $1}' > /srv/sandcrawler/tasks/retry_ojs_loop.2022-01-13.schedule
Done/seeded.
## Seeds: scitemed.com
Batch retry sandcrawler `no-pdf-link` with terminal URL like: `scitemed.com/article`
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_file_result.status = 'no-pdf-link'
AND ingest_file_result.terminal_url LIKE '%/article/view/%'
AND (
ingest_request.link_source = 'doi'
OR ingest_request.link_source = 'doaj'
OR ingest_request.link_source = 'unpaywall'
)
) TO '/srv/sandcrawler/tasks/retry_scitemed.2022-01-13.rows.json';
# SKIPPED
Actually there are very few of these.
## Seeds: non-OA paper DOIs
There are many DOIs out there which are likely to be from small publishers, on
the web, and would ingest just fine (eg, in OJS).
fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' --count
30,938,106
fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'preservation:none' --count
6,664,347
fatcat-cli search release in_ia:false is_oa:false 'doi:*' release_type:article-journal 'container_id:*' '!publisher_type:big5' 'in_kbart:false' --count
8,258,111
Do the 8 million first, then maybe try the 30.9 million later? Do sampling to
see how many are actually accessible? From experience with KBART generation,
many of these are likely to crawl successfully.
./fatcat_ingest.py --ingest-type pdf --allow-non-oa query 'in_ia:false is_oa:false doi:* release_type:article-journal container_id:* !publisher_type:big5 in_kbart:false' \
| pv -l \
| gzip \
> /srv/fatcat/tasks/ingest_nonoa_doi.json.gz
# re-running 2022-02-08 after this VM was upgraded
# Expecting 8321448 release objects in search queries
# DONE
This is large enough that it will probably be a bulk ingest, and then probably
a follow-up crawl.
## Seeds: HTML and XML links from HTML biblio
kafkacat -C -b wbgrp-svc284.us.archive.org:9092 -t sandcrawler-prod.ingest-file-results -e \
| pv -l \
| rg '"(html|xml)_fulltext_url"' \
| rg '"no-pdf-link"' \
| gzip \
> ingest_file_result_fulltext_urls.2022-01-13.json.gz
# cut this off at some point? gzip is terminated weird
zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz | wc -l
# gzip: ingest_file_result_fulltext_urls.2022-01-13.json.gz: unexpected end of file
# 2,538,433
Prepare seedlists (to include in heritrix patch crawl):
zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
| jq .html_biblio.xml_fulltext_url -r \
| rg '://' \
| sort -u -S 4G \
| pv -l \
| gzip \
> ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz
# 1.24M 0:01:35 [12.9k/s]
zcat ingest_file_result_fulltext_urls.2022-01-13.json.gz \
| jq .html_biblio.html_fulltext_url -r \
| rg '://' \
| sort -u -S 4G \
| pv -l \
| gzip \
> ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz
# 549k 0:01:27 [6.31k/s]
zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
| cut -f3 -d/ \
| sort -S 4G \
| uniq -c \
| sort -nr \
| head -n20
534005 dlc.library.columbia.edu
355319 www.degruyter.com
196421 zenodo.org
101450 serval.unil.ch
100631 biblio.ugent.be
47986 digi.ub.uni-heidelberg.de
39187 www.emerald.com
33195 www.cairn.info
25703 boris.unibe.ch
19516 journals.openedition.org
15911 academic.oup.com
11091 repository.dl.itc.u-tokyo.ac.jp
9847 oxfordworldsclassics.com
9698 www.thieme-connect.de
9552 www.idunn.no
9265 www.zora.uzh.ch
8030 www.scielo.br
6543 www.hanspub.org
6229 asmedigitalcollection.asme.org
5651 brill.com
zcat ingest_file_result_fulltext_urls.2022-01-13.xml_urls.txt.gz ingest_file_result_fulltext_urls.2022-01-13.html_urls.txt.gz \
| awk '{print "F+ " $1}' \
> ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
wc -l ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
1785901 ingest_file_result_fulltext_urls.2022-01-13.xml_and_html.schedule
Added to `JOURNALS-PATCH-CRAWL-2022-01`
## Seeds: most doi.org terminal non-success
Unless it is a 404, should retry.
TODO: generate this list
## Non-OA DOI Bulk Ingest
Had previously run:
cat ingest_nonoa_doi.json.gz \
| rg -v "doi.org/10.2139/" \
| rg -v "doi.org/10.1021/" \
| rg -v "doi.org/10.1121/" \
| rg -v "doi.org/10.1515/" \
| rg -v "doi.org/10.1093/" \
| rg -v "europepmc.org" \
| pv -l \
| gzip \
> nonoa_doi.filtered.ingests.json.gz
# 7.35M 0:01:13 [99.8k/s]
Starting a bulk ingest of these on 2022-03-18, which is *before* the crawl has
entirely finished, but after almost all queues (domains) have been done for
several days.
zcat nonoa_doi.filtered.ingests.json.gz \
| rg -v "\\\\" \
| jq . -c \
| kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
Looks like many jstage `no-capture` status; these are still (slowly) crawling.
|