1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
|
plan:
- usual setup and dump ingest requests
- filter ingest requests to targetted ccTLDs, and add those to crawl first
## Transform and Load
# on sandcrawler-vm
mkdir -p /srv/sandcrawler/tasks/doaj
cd /srv/sandcrawler/tasks/doaj
wget 'https://archive.org/download/doaj_data_2020-11-13/doaj_article_data_2022-03-07_all.json.gz'
# in pipenv, in python directory
zcat /srv/sandcrawler/tasks/doaj/doaj_article_data_2022-03-07_all.json.gz | ./scripts/doaj2ingestrequest.py - | pv -l | gzip > /srv/sandcrawler/tasks/doaj/doaj_article_data_2022-03-07_all.ingest_request.json.gz
# 9.08M 0:37:38 [4.02k/s]
zcat /srv/sandcrawler/tasks/doaj/doaj_article_data_2022-03-07_all.ingest_request.json.gz | pv -l | ./persist_tool.py ingest-request -
# Worker: Counter({'total': 9082373, 'insert-requests': 2982535, 'update-requests': 0})
# JSON lines pushed: Counter({'total': 9082373, 'pushed': 9082373})
## Check Pre-Crawl Status
2022-03-09, before the above load:
SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.link_source = 'doaj'
GROUP BY ingest_request.ingest_type, status
-- next time include ingest_type in sort
ORDER BY COUNT DESC
LIMIT 30;
ingest_type | status | count
-------------+--------------------------+---------
pdf | success | 2919808
html | wrong-scope | 1098998
pdf | no-pdf-link | 481532
pdf | redirect-loop | 429006
html | success | 342501
html | unknown-scope | 225390
html | redirect-loop | 223927
html | html-resource-no-capture | 187762
html | no-capture | 185418
pdf | no-capture | 171273
pdf | null-body | 129028
html | null-body | 100296
pdf | terminal-bad-status | 91551
pdf | link-loop | 25447
html | wrong-mimetype | 22640
html | wayback-content-error | 19028
html | terminal-bad-status | 13327
pdf | wrong-mimetype | 7688
xml | success | 6897
html | petabox-error | 5529
pdf | wayback-error | 2706
xml | null-body | 2353
pdf | | 2063
pdf | wayback-content-error | 1349
html | cdx-error | 1169
pdf | cdx-error | 1130
pdf | petabox-error | 679
html | | 620
pdf | empty-blob | 562
html | blocked-cookie | 545
(30 rows)
After the above load:
ingest_type | status | count
-------------+--------------------------+---------
pdf | success | 3036457
pdf | | 1623208
html | | 1208412
html | wrong-scope | 1108132
pdf | no-pdf-link | 485703
pdf | redirect-loop | 436085
html | success | 342594
html | unknown-scope | 225412
html | redirect-loop | 223927
html | html-resource-no-capture | 187999
html | no-capture | 187310
pdf | no-capture | 172033
pdf | null-body | 129266
html | null-body | 100296
pdf | terminal-bad-status | 91799
pdf | link-loop | 26933
html | wrong-mimetype | 22643
html | wayback-content-error | 19028
html | terminal-bad-status | 13327
xml | | 11196
pdf | wrong-mimetype | 7929
xml | success | 6897
html | petabox-error | 5530
pdf | wayback-error | 2707
xml | null-body | 2353
pdf | wayback-content-error | 1353
pdf | cdx-error | 1177
html | cdx-error | 1172
pdf | petabox-error | 771
pdf | empty-blob | 562
(30 rows)
Dump ingest requests for crawling (or bulk ingest first?):
COPY (
SELECT row_to_json(t1.*)
FROM (
SELECT ingest_request.*, ingest_file_result as result
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.base_url = ingest_request.base_url
AND ingest_file_result.ingest_type = ingest_request.ingest_type
WHERE
ingest_request.link_source = 'doaj'
-- AND (ingest_request.ingest_type = 'pdf'
-- OR ingest_request.ingest_type = 'xml')
AND (
ingest_file_result.status IS NULL
OR ingest_file_result.status = 'no-capture'
)
AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
AND ingest_request.base_url NOT LIKE '%://archive.org/%'
AND ingest_request.base_url NOT LIKE '%://web.archive.org/%'
AND ingest_request.base_url NOT LIKE '%://www.archive.org/%'
AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%'
AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%'
AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%'
AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%'
AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%'
AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%'
AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%'
AND ingest_file_result.terminal_url NOT LIKE '%://archive.org/%'
AND ingest_file_result.terminal_url NOT LIKE '%://web.archive.org/%'
AND ingest_file_result.terminal_url NOT LIKE '%://www.archive.org/%'
) t1
) TO '/srv/sandcrawler/tasks/doaj_seedlist_2022-03-09.rows.json';
=> COPY 353819
Not that many! Guess the filters are important?
SELECT COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.base_url = ingest_request.base_url
AND ingest_file_result.ingest_type = ingest_request.ingest_type
WHERE
ingest_request.link_source = 'doaj'
-- AND (ingest_request.ingest_type = 'pdf'
-- OR ingest_request.ingest_type = 'xml')
AND (
ingest_file_result.status IS NULL
OR ingest_file_result.status = 'no-capture'
);
=> 3202164
Transform:
./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/doaj_seedlist_2022-03-09.rows.json | pv -l | shuf > /srv/sandcrawler/tasks/doaj_seedlist_2022-03-09.requests.json
=> 353k 0:00:16 [21.0k/s]
Bulk ingest:
cat /srv/sandcrawler/tasks/doaj_seedlist_2022-03-09.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
Dump seeds again (for crawling):
COPY (
SELECT row_to_json(t1.*)
FROM (
SELECT ingest_request.*, ingest_file_result as result
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.base_url = ingest_request.base_url
AND ingest_file_result.ingest_type = ingest_request.ingest_type
WHERE
ingest_request.link_source = 'doaj'
-- AND (ingest_request.ingest_type = 'pdf'
-- OR ingest_request.ingest_type = 'xml')
AND (
ingest_file_result.status IS NULL
OR ingest_file_result.status = 'no-capture'
)
AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
AND ingest_request.base_url NOT LIKE '%://archive.org/%'
AND ingest_request.base_url NOT LIKE '%://web.archive.org/%'
AND ingest_request.base_url NOT LIKE '%://www.archive.org/%'
AND ingest_file_result.terminal_url NOT LIKE '%journals.sagepub.com%'
AND ingest_file_result.terminal_url NOT LIKE '%pubs.acs.org%'
AND ingest_file_result.terminal_url NOT LIKE '%ahajournals.org%'
AND ingest_file_result.terminal_url NOT LIKE '%www.journal.csj.jp%'
AND ingest_file_result.terminal_url NOT LIKE '%aip.scitation.org%'
AND ingest_file_result.terminal_url NOT LIKE '%academic.oup.com%'
AND ingest_file_result.terminal_url NOT LIKE '%tandfonline.com%'
AND ingest_file_result.terminal_url NOT LIKE '%://archive.org/%'
AND ingest_file_result.terminal_url NOT LIKE '%://web.archive.org/%'
AND ingest_file_result.terminal_url NOT LIKE '%://www.archive.org/%'
) t1
) TO '/srv/sandcrawler/tasks/doaj_seedlist_2022-03-10.rows.json';
# COPY 350661
And stats again:
ingest_type | status | count
-------------+--------------------------+---------
pdf | success | 3037059
pdf | | 1623208
html | | 1208412
html | wrong-scope | 1108476
pdf | no-pdf-link | 485705
pdf | redirect-loop | 436850
html | success | 342762
html | unknown-scope | 225412
html | redirect-loop | 224683
html | html-resource-no-capture | 188058
html | no-capture | 185734
pdf | no-capture | 170452
pdf | null-body | 129266
html | null-body | 100296
pdf | terminal-bad-status | 91875
pdf | link-loop | 26933
html | wrong-mimetype | 22643
html | wayback-content-error | 19042
html | terminal-bad-status | 13333
xml | | 11196
pdf | wrong-mimetype | 7929
xml | success | 6898
html | petabox-error | 5535
pdf | wayback-error | 2711
xml | null-body | 2353
pdf | wayback-content-error | 1353
pdf | cdx-error | 1177
html | cdx-error | 1172
pdf | petabox-error | 772
html | blocked-cookie | 769
(30 rows)
Transform:
./scripts/ingestrequest_row2json.py /srv/sandcrawler/tasks/doaj_seedlist_2022-03-10.rows.json | pv -l | shuf > /srv/sandcrawler/tasks/doaj_seedlist_2022-03-10.requests.json
Create seedlist:
cat /srv/sandcrawler/tasks/doaj_seedlist_2022-03-10.requests.json \
| jq -r .base_url \
| sort -u -S 4G \
> /srv/sandcrawler/tasks/doaj_seedlist_2022-03-10.txt
Send off an added to `TARGETED-ARTICLE-CRAWL-2022-03` heritrix crawl, will
re-ingest when that completes (a week or two?).
|