aboutsummaryrefslogtreecommitdiffstats
path: root/notes/tasks/2022-03-07_ukraine_firedrill.md
blob: 222f9b7d55d4a71bdddde3cfc5b94ebc7a8be075 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225

Want to do priority crawling of Ukranian web content, plus Russia and Belarus.


## What is Missing?

    (country_code:ua OR lang:uk)
    => 2022-03-08, before ingests: 470,986 total, 170,987 missing, almost all article-journal, peak in 2019, 55k explicitly OA
    => later in day, already some 22k missing found! wow


## Metadata Prep

- container metadata update (no code changes)
    x  wikidata SPARQL update
    x  chocula run
    x  journal metadata update (fatcat)
    x  update journal stats (fatcat extra)
- DOAJ article metadata import
    x  prep and upload single JSON file


## Journal Homepage URL Crawl

x dump ukraine-related journal homepages from chocula DB
x create crawl config
x start crawl
x repeat for belarus and russia


    python3 -m chocula export_urls > homepage_urls.2022-03-08.tsv
    cat homepage_urls.2022-03-08.tsv | cut -f2 | rg '\.ua/' | sort -u > homepage_urls.2022-03-08.ua_tld.tsv
    wc -l homepage_urls.2022-03-08.ua_tld.tsv
    1550 homepage_urls.2022-03-08.ua_tld.tsv

    cat homepage_urls.2022-03-08.tsv | cut -f2 | rg '\.by/' | sort -u > homepage_urls.2022-03-08.by_tld.tsv
    cat homepage_urls.2022-03-08.tsv | cut -f2 | rg '\.ru/' | sort -u > homepage_urls.2022-03-08.ru_tld.tsv

sqlite3:

    select count(*) from journal where country = 'ua' or lang = 'uk' or name like '%ukrain%' or publi
    1952

    SELECT COUNT(*) FROM homepage
    LEFT JOIN journal ON homepage.issnl = journal.issnl
    WHERE
        journal.country = 'ua'
        OR journal.lang = 'uk'
        OR journal.name like '%ukrain%'
        OR journal.publisher like '%ukrain%';
    => 1970

    .mode csv
    .once homepage_urls_ukraine.tsv
    SELECT homepage.url FROM homepage
    LEFT JOIN journal ON homepage.issnl = journal.issnl
    WHERE
        journal.country = 'ua'
        OR journal.lang = 'uk'
        OR journal.name like '%ukrain%'
        OR journal.publisher like '%ukrain%';

    .mode csv
    .once homepage_urls_russia.tsv
    SELECT homepage.url FROM homepage
    LEFT JOIN journal ON homepage.issnl = journal.issnl
    WHERE
        journal.country = 'ru'
        OR journal.lang = 'ru'
        OR journal.name like '%russ%'
        OR journal.publisher like '%russ%';

    .mode csv
    .once homepage_urls_belarus.tsv
    SELECT homepage.url FROM homepage
    LEFT JOIN journal ON homepage.issnl = journal.issnl
    WHERE
        journal.country = 'by'
        OR journal.lang = 'be'
        OR journal.name like '%belarus%'
        OR journal.publisher like '%belarus%';

    cat homepage_urls_ukraine.tsv homepage_urls.2022-03-08.ua_tld.tsv | sort -u > homepage_urls_ukraine_combined.2022-03-08.tsv

    wc -l homepage_urls.2022-03-08.ua_tld.tsv homepage_urls_ukraine.tsv homepage_urls_ukraine_combined.2022-03-08.tsv 
        1550 homepage_urls.2022-03-08.ua_tld.tsv
        1971 homepage_urls_ukraine.tsv
        3482 homepage_urls_ukraine_combined.2022-03-08.tsv

    cat homepage_urls_russia.tsv homepage_urls.2022-03-08.ru_tld.tsv | sort -u > homepage_urls_russia_combined.2022-03-08.tsv

    wc -l homepage_urls_russia.tsv homepage_urls.2022-03-08.ru_tld.tsv homepage_urls_russia_combined.2022-03-08.tsv
        3728 homepage_urls_russia.tsv
        2420 homepage_urls.2022-03-08.ru_tld.tsv
        6030 homepage_urls_russia_combined.2022-03-08.tsv


    cat homepage_urls_belarus.tsv homepage_urls.2022-03-08.by_tld.tsv | sort -u > homepage_urls_belarus_combined.2022-03-08.tsv

    wc -l homepage_urls_belarus.tsv homepage_urls.2022-03-08.by_tld.tsv homepage_urls_belarus_combined.2022-03-08.tsv
        138 homepage_urls_belarus.tsv
        85 homepage_urls.2022-03-08.by_tld.tsv
        222 homepage_urls_belarus_combined.2022-03-08.tsv


## Landing Page Crawl

x create crawl config
x fatcat ingest query for related URLs
    => special request code/label?
x finish .by and .ru article URL dump, start crawling
x URL list filtered from new OAI-PMH feed
    => do we need to do full bulk load/dump, or not?
- URL list from partner (google)
- do we need to do alternative thing of iterating over containers, ingesting each?

    ./fatcat_ingest.py --env prod \
        --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --kafka-request-topic sandcrawler-prod.ingest-file-requests-bulk \
        --ingest-type pdf \
        --allow-non-oa \
        query "country_code:ua OR lang:uk"

    # around Tue 08 Mar 2022 01:07:37 PM PST
    # Expecting 185659 release objects in search queries
    # didn't complete successfully? hrm

    # ok, retry "manually" (with kafkacat)
    ./fatcat_ingest.py --env prod \
        --ingest-type pdf \
        --allow-non-oa \
        query "country_code:ua OR lang:uk" \
    | pv -l \
    | gzip \
    > /srv/fatcat/ingest_ua_pdfs.2022-03-08.requests.json
    # Counter({'elasticsearch_release': 172881, 'estimate': 172881, 'ingest_request': 103318})
    # 103k 0:25:04 [68.7 /s]

    zcat /srv/fatcat/ingest_ua_pdfs.2022-03-08.requests.json \
        | rg -v "\\\\" \
        | jq . -c \
        | pv -l \
        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

    zcat ingest_ua_pdfs.2022-03-08.requests.json.gz | jq .base_url -r | sort -u | pv -l | gzip > ingest_ua_pdfs.2022-03-08.txt.gz
    # 103k 0:00:02 [38.1k/s]

    ./fatcat_ingest.py --env prod \
        --ingest-type pdf \
        --allow-non-oa \
        query "country_code:by OR lang:be" \
    | pv -l \
    | gzip \
    > /srv/fatcat/tasks/ingest_by_pdfs.2022-03-09.requests.json.gz
    # Expecting 2266 release objects in search queries
    # 1.29k 0:00:34 [37.5 /s]

    zcat /srv/fatcat/tasks/ingest_by_pdfs.2022-03-09.requests.json.gz \
        | rg -v "\\\\" \
        | jq . -c \
        | pv -l \
        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

    zcat ingest_by_pdfs.2022-03-09.requests.json.gz | jq .base_url -r | sort -u | pv -l | gzip > ingest_by_pdfs.2022-03-09.txt.gz

    ./fatcat_ingest.py --env prod \
        --ingest-type pdf \
        --allow-non-oa \
        query "country_code:ru OR lang:ru" \
    | pv -l \
    | gzip \
    > /srv/fatcat/tasks/ingest_ru_pdfs.2022-03-09.requests.json.gz
    # Expecting 1515246 release objects in search queries

    zcat /srv/fatcat/tasks/ingest_ru_pdfs.2022-03-09.requests.partial.json.gz \
        | rg -v "\\\\" \
        | jq . -c \
        | pv -l \
        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1

    zcat ingest_ru_pdfs.2022-03-09.requests.partial.json.gz | jq .base_url -r | sort -u | pv -l | gzip > ingest_ru_pdfs.2022-03-09.txt.gz


    zstdcat oai_pmh_partial_dump_2022_03_01_urls.txt.zst | rg '\.ua/' | pv -l > oai_pmh_partial_dump_2022_03_01_urls.ua_tld.txt
    # 309k 0:00:03 [81.0k/s]

    zstdcat oai_pmh_partial_dump_2022_03_01_urls.txt.zst | rg '\.by/' | pv -l > oai_pmh_partial_dump_2022_03_01_urls.by_tld.txt
    # 71.2k 0:00:03 [19.0k/s]

    zstdcat oai_pmh_partial_dump_2022_03_01_urls.txt.zst | rg '\.ru/' | pv -l > oai_pmh_partial_dump_2022_03_01_urls.ru_tld.txt
    # 276k 0:00:03 [72.9k/s]


### Landing Page Bulk Ingest

Running these 2022-03-24, after targeted crawl completed:

    zcat /srv/fatcat/tasks/ingest_ua_pdfs.2022-03-08.requests.json.gz \
        | rg -v "\\\\" \
        | jq . -c \
        | pv -l \
        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
    # 103k 0:00:02 [36.1k/s]

    zcat /srv/fatcat/tasks/ingest_by_pdfs.2022-03-09.requests.json.gz \
        | rg -v "\\\\" \
        | jq . -c \
        | pv -l \
        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
    # 1.29k 0:00:00 [15.8k/s]

    zcat /srv/fatcat/tasks/ingest_ru_pdfs.2022-03-09.requests.partial.json.gz \
        | rg -v "\\\\" \
        | jq . -c \
        | pv -l \
        | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
    # 546k 0:00:13 [40.6k/s]

It will probably take a week or more for these to complete.


## Outreach

- openalex
- sucho.org
- ceeol.com