1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
|
Martin did another OAI-PMH bulk crawl, this time with the old JSON format: <https://archive.org/download/oai_harvest_20220921>
I updated the transform script to block some additional domains.
## Prep
Fetch the snapshot:
cd /srv/sandcrawler/tasks/
wget https://archive.org/download/oai_harvest_20220921/2022-09-21-oai-pmh-metadata-compat.jsonl.zst
Transform to ingest requests:
cd /srv/sandcrawler/src/python
git log | head -n1
# commit dfd4605d84712eccb95a63e50b0bcb343642b433
pipenv shell
zstdcat /srv/sandcrawler/tasks/2022-09-21-oai-pmh-metadata-compat.jsonl.zst \
| ./scripts/oai2ingestrequest.py - \
| pv -l \
| gzip \
> /srv/sandcrawler/tasks/2022-09-21_oaipmh_ingestrequests.json.gz
# 16.1M 1:01:02 [4.38k/s]
Curious about types, though this would probably be handled at fatcat ingest
time:
zstdcat 2022-09-21-oai-pmh-metadata-compat.jsonl.zst | jq '.types[]' -r | sort | uniq -c | sort -nr > oai_type_counts.txt
head oai_type_counts.txt -n30
5623867 info:eu-repo/semantics/article
5334928 info:eu-repo/semantics/publishedVersion
3870359 text
1240225 Text
829169 Article
769849 NonPeerReviewed
665700 PeerReviewed
648740 Peer-reviewed Article
547857 article
482906 info:eu-repo/semantics/bachelorThesis
353814 Thesis
329269 Student thesis
262650 info:eu-repo/semantics/conferenceObject
185354 Journal articles
162021 info:eu-repo/semantics/doctoralThesis
152079 Journal Article
150226 Research Article
130217 Conference papers
127255 Artículo revisado por pares
124243 Newspaper
123908 ##rt.metadata.pkp.peerReviewed##
123309 Photograph
122981 info:eu-repo/semantics/masterThesis
116719 Book
108946 Image
108216 Report
107946 Other
103562 masterThesis
103038 info:eu-repo/semantics/other
101404 StillImage
[...]
And formats:
zstdcat 2022-09-21-oai-pmh-metadata-compat.jsonl.zst | jq '.formats[]' -r | sort | uniq -c | sort -nr > oai_format_counts.txt
head -n 20 oai_format_counts.txt
11151928 application/pdf
677413 text
561656 text/html
498518 image/jpeg
231219 Text
193638 text/xml
147214 Image
117073 image/jpg
110872 pdf
91323 image/tiff
76948 bib
75393 application/xml
70244 Digitized from 35 mm. microfilm.
68206 mods
59227 PDF
57677 application/epub+zip
57602 application/octet-stream
52072 text/plain
51620 application/msword
47227 audio/mpeg
Also, just overall size (number of records):
zstdcat 2022-09-21-oai-pmh-metadata-compat.jsonl.zst | wc -l
# 20,840,301
Next load in to sandcrawler DB:
zcat /srv/sandcrawler/tasks/2022-09-21_oaipmh_ingestrequests.json.gz | pv -l | ./persist_tool.py ingest-request -
Traceback (most recent call last):
File "./persist_tool.py", line 311, in <module>
main()
File "./persist_tool.py", line 307, in main
args.func(args)
File "./persist_tool.py", line 119, in run_ingest_request
pusher.run()
File "/1/srv/sandcrawler/src/python/sandcrawler/workers.py", line 397, in run
self.worker.push_batch(batch)
File "/1/srv/sandcrawler/src/python/sandcrawler/persist.py", line 342, in push_batch
resp = self.db.insert_ingest_request(self.cur, irequests)
File "/1/srv/sandcrawler/src/python/sandcrawler/db.py", line 459, in insert_ingest_request
resp = psycopg2.extras.execute_values(cur, sql, rows, page_size=250, fetch=True)
File "/1/srv/sandcrawler/src/python/.venv/lib/python3.8/site-packages/psycopg2/extras.py", line 1270, in execute_values
cur.execute(b''.join(parts))
psycopg2.errors.ProgramLimitExceeded: index row size 3400 exceeds btree version 4 maximum 2704 for index "ingest_request_base_url_idx"
DETAIL: Index row references tuple (6893121,3) in relation "ingest_request".
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
15.7M 0:41:48 [6.27k/s]
Darn, this means we won't get reasonable stats about how many rows were
inserted/updated.
Patched the persist tool to skip very long URLs, and ran again (backwards, just
URLs which didn't get inserted already):
zcat /srv/sandcrawler/tasks/2022-09-21_oaipmh_ingestrequests.json.gz \
| tac \
| head -n1000000 \
| pv -l \
| ./persist_tool.py ingest-request -
# 1.00M 0:03:04 [5.41k/s]
# Worker: Counter({'total': 1000000, 'insert-requests': 124701, 'skip-url-too-long': 1, 'update-requests': 0})
Status of just the new lines:
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
AND date(ingest_request.created) > '2022-09-01'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------+---------
| 6398455
success | 540219
no-pdf-link | 41316
link-loop | 23871
no-capture | 11350
redirect-loop | 8315
wrong-mimetype | 2394
terminal-bad-status | 1540
null-body | 1038
cdx-error | 272
empty-blob | 237
petabox-error | 213
wayback-error | 186
blocked-cookie | 107
timeout | 47
wayback-content-error | 26
spn2-cdx-lookup-failure | 21
skip-url-blocklist | 16
spn2-backoff | 15
body-too-large | 13
(20 rows)
## Bulk Ingest
Should already have filtered domains/prefixes in transform script, so not
including filters here.
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
AND date(ingest_request.created) > '2022-09-01'
AND ingest_file_result.status IS NULL
) TO '/srv/sandcrawler/tasks/oai_noingest_20220921.rows.json';
# COPY 6398455
./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/oai_noingest_20220921.rows.json \
| pv -l \
| shuf \
> /srv/sandcrawler/tasks/oai_noingest_20220921.ingest_request.json
# 6.40M 0:02:18 [46.2k/s]
cat /srv/sandcrawler/tasks/oai_noingest_20220921.ingest_request.json \
| rg -v "\\\\" \
| jq . -c \
| kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
# DONE
Expect this ingest to take a week or so.
Then, run stats again:
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
AND date(ingest_request.created) > '2022-09-01'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------+---------
no-capture | 3617175
success | 2775036
no-pdf-link | 449298
link-loop | 74260
terminal-bad-status | 47819
wrong-mimetype | 20195
redirect-loop | 18197
empty-blob | 12127
cdx-error | 3038
skip-url-blocklist | 2630
wayback-error | 2599
petabox-error | 2354
wayback-content-error | 1617
blocked-cookie | 1293
null-body | 1038
body-too-large | 670
| 143
bad-gzip-encoding | 64
timeout | 47
spn2-cdx-lookup-failure | 20
(20 rows)
## Crawl Seedlist
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
AND date(ingest_request.created) > '2022-09-01'
AND (
ingest_file_result.status = 'no-capture'
OR ingest_file_result.status = 'redirect-loop'
OR ingest_file_result.status = 'terminal-bad-status'
OR ingest_file_result.status = 'cdx-error'
OR ingest_file_result.status = 'petabox-error'
OR ingest_file_result.status = 'wayback-error'
OR ingest_file_result.status = 'timeout'
OR ingest_file_result.status = 'wayback-content-error'
)
) TO '/srv/sandcrawler/tasks/oai_nocapture_20220921.rows.json';
=> COPY 3692846
./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/oai_nocapture_20220921.rows.json \
| pv -l \
| shuf \
> /srv/sandcrawler/tasks/oai_nocapture_20220921.ingest_request.json
=> 3.69M 0:01:19 [46.6k/s]
This will be used for re-ingest later. For now, extract URLs:
cat /srv/sandcrawler/tasks/oai_nocapture_20220921.rows.json \
| jq .base_url -r \
| sort -u -S 4G \
| pv -l \
> /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt
=> 3.66M 0:00:59 [61.8k/s]
cat /srv/sandcrawler/tasks/oai_nocapture_20220921.rows.json \
| rg '"terminal_url"' \
| jq -r .result.terminal_url \
| rg -v ^null$ \
| sort -u -S 4G \
| pv -l \
> /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt
=> 0.00 0:00:05 [0.00 /s]
cat /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt \
| awk '{print "F+ " $1}' \
| shuf \
> /srv/sandcrawler/tasks/oai_nocapture_20220921.schedule
What domains are we crawling?
cat /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt \
| sort -u -S 4G \
| cut -d/ -f3 \
| sort \
| uniq -c \
| sort -nr \
> /srv/sandcrawler/tasks/oai_nocapture_20220921.domains.txt
head -n20 /srv/sandcrawler/tasks/oai_nocapture_20220921.domains.txt
91899 raco.cat
70116 islandora.wrlc.org
68708 urn.kb.se
63726 citeseerx.ist.psu.edu
50370 publications.rwth-aachen.de
44885 urn.nsk.hr
38429 server15795.contentdm.oclc.org
33041 periodicos.ufpb.br
32519 nbn-resolving.org
31990 www.ajol.info
24745 hal.archives-ouvertes.fr
22569 id.nii.ac.jp
17239 tilburguniversity.on.worldcat.org
15873 dspace.nbuv.gov.ua
15436 digitalcommons.wustl.edu
14885 www.iiste.org
14623 www.manchester.ac.uk
14033 nbn-resolving.de
13999 opus4.kobv.de
13689 www.redalyc.org
Sizes:
wc -l /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt /srv/sandcrawler/tasks/oai_nocapture_20220921.schedule
3662864 /srv/sandcrawler/tasks/oai_nocapture_20220921.base_url.txt
0 /srv/sandcrawler/tasks/oai_nocapture_20220921.terminal_url.txt
3662864 /srv/sandcrawler/tasks/oai_nocapture_20220921.schedule
Copy seedlist to crawler:
# as regular user
scp /srv/sandcrawler/tasks/oai_nocapture_20220921.schedule wbgrp-svc206.us.archive.org:/tmp
|