1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
|
Primary Goal: start large crawl of OAI landing pages that we haven't seen
Fields of interest for ingest:
- oai identifer
- doi
- formats
- urls (maybe also "relations")
- types (type+stage)
## Other Tasks
About 150 million total lines.
Types coverage
zstdcat oai.ndjson.zst | pv -l | jq "select(.types != null) | .types[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > types_counts.txt
Dump all ISSNs, with counts, quick check how many are in chocula/fatcat
zstdcat oai.ndjson.zst | pv -l | jq "select(.issn != null) | .issn[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > issn_counts.txt
Language coverage
zstdcat oai.ndjson.zst | pv -l | jq "select(.languages != null) | .languages[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > languages_counts.txt
Format coverage
zstdcat oai.ndjson.zst | pv -l | jq "select(.formats != null) | .formats[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > formats_counts.txt
=> 150M 0:56:14 [44.7k/s]
Have a DOI?
zstdcat oai.ndjson.zst | pv -l | rg '"doi":' | rg '"10.' | wc -l
=> 16,013,503
zstdcat oai.ndjson.zst | pv -l | jq "select(.doi != null) | .doi[]" -r | sort -u -S 5G > doi_raw.txt
=> 11,940,950
## Transform, Load, Bulk Ingest
zstdcat oai.ndjson.zst | ./oai2ingestrequest.py - | pv -l | gzip > oai.202002.requests.json.gz
=> 80M 6:36:55 [3.36k/s]
time zcat /schnell/oai-pmh/oai.202002.requests.json.gz | pv -l | ./persist_tool.py ingest-request -
=> 80M 4:00:21 [5.55k/s]
=> Worker: Counter({'total': 80013963, 'insert-requests': 51169081, 'update-requests': 0})
=> JSON lines pushed: Counter({'pushed': 80013963, 'total': 80013963})
=> real 240m21.207s
=> user 85m12.576s
=> sys 3m29.580s
select count(*) from ingest_request where ingest_type = 'pdf' and link_source = 'oai';
=> 51,185,088
Why so many (30 million) skipped? Not unique?
zcat oai.202002.requests.json.gz | jq '[.link_source_id, .base_url]' -c | sort -u -S 4G | wc -l
=> 51,185,088
zcat oai.202002.requests.json.gz | jq .base_url -r | pv -l | sort -u -S 4G > request_url.txt
wc -l request_url.txt
=> 50,002,674 request_url.txt
zcat oai.202002.requests.json.gz | jq .link_source_id -r | pv -l | sort -u -S 4G > requires_oai.txt
wc -l requires_oai.txt
=> 34,622,083 requires_oai.txt
Yup, tons of duplication. And remember this is exact URL, not SURT or similar.
How many of these are URLs we have seen and ingested already?
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------+----------
| 49491452
success | 1469113
no-capture | 134611
redirect-loop | 59666
no-pdf-link | 8947
cdx-error | 7561
terminal-bad-status | 6704
null-body | 5042
wrong-mimetype | 879
wayback-error | 722
petabox-error | 198
gateway-timeout | 86
link-loop | 51
invalid-host-resolution | 24
spn2-cdx-lookup-failure | 22
spn2-error | 4
bad-gzip-encoding | 4
spn2-error:job-failed | 2
(18 rows)
Dump ingest requests:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
AND date(ingest_request.created) > '2020-05-01'
AND ingest_file_result.status IS NULL
) TO '/grande/snapshots/oai_noingest_20200506.rows.json';
=> COPY 49491452
WARNING: should have transformed from rows to requests here
cat /grande/snapshots/oai_noingest_20200506.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
## Crawl and re-ingest
Updated stats after ingest (NOTE: ingest requests not really formed correctly,
but doesn't matter because fatcat wasn't importing these anyways):
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------+----------
no-capture | 42565875
success | 5227609
no-pdf-link | 2156341
redirect-loop | 559721
cdx-error | 260446
wrong-mimetype | 148871
terminal-bad-status | 109725
link-loop | 92792
null-body | 30688
| 15287
petabox-error | 11109
wayback-error | 6261
skip-url-blocklist | 184
gateway-timeout | 86
bad-gzip-encoding | 25
invalid-host-resolution | 24
spn2-cdx-lookup-failure | 22
bad-redirect | 15
spn2-error | 4
spn2-error:job-failed | 2
(20 rows)
Dump again for crawling:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
AND date(ingest_request.created) > '2020-05-01'
AND (ingest_file_result.status = 'no-capture' or ingest_file_result.status = 'cdx-error')
) TO '/grande/snapshots/oai_tocrawl_20200526.rows.json';
|