1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
|
Want to do priority crawling of Ukranian web content, plus Russia and Belarus.
## What is Missing?
(country_code:ua OR lang:uk)
=> 2022-03-08, before ingests: 470,986 total, 170,987 missing, almost all article-journal, peak in 2019, 55k explicitly OA
later in day, already some 22k missing found! wow
=> 2022-04-04, after ingests: 476,174 total, 131,063 missing, 49k OA missing
## Metadata Prep
- container metadata update (no code changes)
x wikidata SPARQL update
x chocula run
x journal metadata update (fatcat)
x update journal stats (fatcat extra)
- DOAJ article metadata import
x prep and upload single JSON file
## Journal Homepage URL Crawl
x dump ukraine-related journal homepages from chocula DB
x create crawl config
x start crawl
x repeat for belarus and russia
python3 -m chocula export_urls > homepage_urls.2022-03-08.tsv
cat homepage_urls.2022-03-08.tsv | cut -f2 | rg '\.ua/' | sort -u > homepage_urls.2022-03-08.ua_tld.tsv
wc -l homepage_urls.2022-03-08.ua_tld.tsv
1550 homepage_urls.2022-03-08.ua_tld.tsv
cat homepage_urls.2022-03-08.tsv | cut -f2 | rg '\.by/' | sort -u > homepage_urls.2022-03-08.by_tld.tsv
cat homepage_urls.2022-03-08.tsv | cut -f2 | rg '\.ru/' | sort -u > homepage_urls.2022-03-08.ru_tld.tsv
sqlite3:
select count(*) from journal where country = 'ua' or lang = 'uk' or name like '%ukrain%' or publi
1952
SELECT COUNT(*) FROM homepage
LEFT JOIN journal ON homepage.issnl = journal.issnl
WHERE
journal.country = 'ua'
OR journal.lang = 'uk'
OR journal.name like '%ukrain%'
OR journal.publisher like '%ukrain%';
=> 1970
.mode csv
.once homepage_urls_ukraine.tsv
SELECT homepage.url FROM homepage
LEFT JOIN journal ON homepage.issnl = journal.issnl
WHERE
journal.country = 'ua'
OR journal.lang = 'uk'
OR journal.name like '%ukrain%'
OR journal.publisher like '%ukrain%';
.mode csv
.once homepage_urls_russia.tsv
SELECT homepage.url FROM homepage
LEFT JOIN journal ON homepage.issnl = journal.issnl
WHERE
journal.country = 'ru'
OR journal.lang = 'ru'
OR journal.name like '%russ%'
OR journal.publisher like '%russ%';
.mode csv
.once homepage_urls_belarus.tsv
SELECT homepage.url FROM homepage
LEFT JOIN journal ON homepage.issnl = journal.issnl
WHERE
journal.country = 'by'
OR journal.lang = 'be'
OR journal.name like '%belarus%'
OR journal.publisher like '%belarus%';
cat homepage_urls_ukraine.tsv homepage_urls.2022-03-08.ua_tld.tsv | sort -u > homepage_urls_ukraine_combined.2022-03-08.tsv
wc -l homepage_urls.2022-03-08.ua_tld.tsv homepage_urls_ukraine.tsv homepage_urls_ukraine_combined.2022-03-08.tsv
1550 homepage_urls.2022-03-08.ua_tld.tsv
1971 homepage_urls_ukraine.tsv
3482 homepage_urls_ukraine_combined.2022-03-08.tsv
cat homepage_urls_russia.tsv homepage_urls.2022-03-08.ru_tld.tsv | sort -u > homepage_urls_russia_combined.2022-03-08.tsv
wc -l homepage_urls_russia.tsv homepage_urls.2022-03-08.ru_tld.tsv homepage_urls_russia_combined.2022-03-08.tsv
3728 homepage_urls_russia.tsv
2420 homepage_urls.2022-03-08.ru_tld.tsv
6030 homepage_urls_russia_combined.2022-03-08.tsv
cat homepage_urls_belarus.tsv homepage_urls.2022-03-08.by_tld.tsv | sort -u > homepage_urls_belarus_combined.2022-03-08.tsv
wc -l homepage_urls_belarus.tsv homepage_urls.2022-03-08.by_tld.tsv homepage_urls_belarus_combined.2022-03-08.tsv
138 homepage_urls_belarus.tsv
85 homepage_urls.2022-03-08.by_tld.tsv
222 homepage_urls_belarus_combined.2022-03-08.tsv
## Landing Page Crawl
x create crawl config
x fatcat ingest query for related URLs
=> special request code/label?
x finish .by and .ru article URL dump, start crawling
x URL list filtered from new OAI-PMH feed
=> do we need to do full bulk load/dump, or not?
- URL list from partner (google)
- do we need to do alternative thing of iterating over containers, ingesting each?
./fatcat_ingest.py --env prod \
--enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-bulk \
--ingest-type pdf \
--allow-non-oa \
query "country_code:ua OR lang:uk"
# around Tue 08 Mar 2022 01:07:37 PM PST
# Expecting 185659 release objects in search queries
# didn't complete successfully? hrm
# ok, retry "manually" (with kafkacat)
./fatcat_ingest.py --env prod \
--ingest-type pdf \
--allow-non-oa \
query "country_code:ua OR lang:uk" \
| pv -l \
| gzip \
> /srv/fatcat/ingest_ua_pdfs.2022-03-08.requests.json
# Counter({'elasticsearch_release': 172881, 'estimate': 172881, 'ingest_request': 103318})
# 103k 0:25:04 [68.7 /s]
zcat /srv/fatcat/ingest_ua_pdfs.2022-03-08.requests.json \
| rg -v "\\\\" \
| jq . -c \
| pv -l \
| kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
zcat ingest_ua_pdfs.2022-03-08.requests.json.gz | jq .base_url -r | sort -u | pv -l | gzip > ingest_ua_pdfs.2022-03-08.txt.gz
# 103k 0:00:02 [38.1k/s]
./fatcat_ingest.py --env prod \
--ingest-type pdf \
--allow-non-oa \
query "country_code:by OR lang:be" \
| pv -l \
| gzip \
> /srv/fatcat/tasks/ingest_by_pdfs.2022-03-09.requests.json.gz
# Expecting 2266 release objects in search queries
# 1.29k 0:00:34 [37.5 /s]
zcat /srv/fatcat/tasks/ingest_by_pdfs.2022-03-09.requests.json.gz \
| rg -v "\\\\" \
| jq . -c \
| pv -l \
| kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
zcat ingest_by_pdfs.2022-03-09.requests.json.gz | jq .base_url -r | sort -u | pv -l | gzip > ingest_by_pdfs.2022-03-09.txt.gz
./fatcat_ingest.py --env prod \
--ingest-type pdf \
--allow-non-oa \
query "country_code:ru OR lang:ru" \
| pv -l \
| gzip \
> /srv/fatcat/tasks/ingest_ru_pdfs.2022-03-09.requests.json.gz
# Expecting 1515246 release objects in search queries
zcat /srv/fatcat/tasks/ingest_ru_pdfs.2022-03-09.requests.partial.json.gz \
| rg -v "\\\\" \
| jq . -c \
| pv -l \
| kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
zcat ingest_ru_pdfs.2022-03-09.requests.partial.json.gz | jq .base_url -r | sort -u | pv -l | gzip > ingest_ru_pdfs.2022-03-09.txt.gz
zstdcat oai_pmh_partial_dump_2022_03_01_urls.txt.zst | rg '\.ua/' | pv -l > oai_pmh_partial_dump_2022_03_01_urls.ua_tld.txt
# 309k 0:00:03 [81.0k/s]
zstdcat oai_pmh_partial_dump_2022_03_01_urls.txt.zst | rg '\.by/' | pv -l > oai_pmh_partial_dump_2022_03_01_urls.by_tld.txt
# 71.2k 0:00:03 [19.0k/s]
zstdcat oai_pmh_partial_dump_2022_03_01_urls.txt.zst | rg '\.ru/' | pv -l > oai_pmh_partial_dump_2022_03_01_urls.ru_tld.txt
# 276k 0:00:03 [72.9k/s]
### Landing Page Bulk Ingest
Running these 2022-03-24, after targeted crawl completed:
zcat /srv/fatcat/tasks/ingest_ua_pdfs.2022-03-08.requests.json.gz \
| rg -v "\\\\" \
| jq . -c \
| pv -l \
| kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
# 103k 0:00:02 [36.1k/s]
zcat /srv/fatcat/tasks/ingest_by_pdfs.2022-03-09.requests.json.gz \
| rg -v "\\\\" \
| jq . -c \
| pv -l \
| kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
# 1.29k 0:00:00 [15.8k/s]
zcat /srv/fatcat/tasks/ingest_ru_pdfs.2022-03-09.requests.partial.json.gz \
| rg -v "\\\\" \
| jq . -c \
| pv -l \
| kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
# 546k 0:00:13 [40.6k/s]
It will probably take a week or more for these to complete.
## Outreach
- openalex
- sucho.org
- ceeol.com
|