1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
|
Primary Goal: start large crawl of OAI landing pages that we haven't seen
Fields of interest for ingest:
- oai identifer
- doi
- formats
- urls (maybe also "relations")
- types (type+stage)
## Other Tasks
About 150 million total lines.
Types coverage
zstdcat oai.ndjson.zst | pv -l | jq "select(.types != null) | .types[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > types_counts.txt
Dump all ISSNs, with counts, quick check how many are in chocula/fatcat
zstdcat oai.ndjson.zst | pv -l | jq "select(.issn != null) | .issn[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > issn_counts.txt
Language coverage
zstdcat oai.ndjson.zst | pv -l | jq "select(.languages != null) | .languages[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > languages_counts.txt
Format coverage
zstdcat oai.ndjson.zst | pv -l | jq "select(.formats != null) | .formats[]" -r | sort -S 5G | uniq -c | sort -nr -S 1G > formats_counts.txt
=> 150M 0:56:14 [44.7k/s]
Have a DOI?
zstdcat oai.ndjson.zst | pv -l | rg '"doi":' | rg '"10.' | wc -l
=> 16,013,503
zstdcat oai.ndjson.zst | pv -l | jq "select(.doi != null) | .doi[]" -r | sort -u -S 5G > doi_raw.txt
=> 11,940,950
## Transform, Load, Bulk Ingest
zstdcat oai.ndjson.zst | ./oai2ingestrequest.py - | pv -l | gzip > oai.202002.requests.json.gz
=> 80M 6:36:55 [3.36k/s]
time zcat /schnell/oai-pmh/oai.202002.requests.json.gz | pv -l | ./persist_tool.py ingest-request -
=> 80M 4:00:21 [5.55k/s]
=> Worker: Counter({'total': 80013963, 'insert-requests': 51169081, 'update-requests': 0})
=> JSON lines pushed: Counter({'pushed': 80013963, 'total': 80013963})
=> real 240m21.207s
=> user 85m12.576s
=> sys 3m29.580s
select count(*) from ingest_request where ingest_type = 'pdf' and link_source = 'oai';
=> 51,185,088
Why so many (30 million) skipped? Not unique?
zcat oai.202002.requests.json.gz | jq '[.link_source_id, .base_url]' -c | sort -u -S 4G | wc -l
=> 51,185,088
zcat oai.202002.requests.json.gz | jq .base_url -r | pv -l | sort -u -S 4G > request_url.txt
wc -l request_url.txt
=> 50,002,674 request_url.txt
zcat oai.202002.requests.json.gz | jq .link_source_id -r | pv -l | sort -u -S 4G > requires_oai.txt
wc -l requires_oai.txt
=> 34,622,083 requires_oai.txt
Yup, tons of duplication. And remember this is exact URL, not SURT or similar.
How many of these are URLs we have seen and ingested already?
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 20;
status | count
-------------------------+----------
| 49491452
success | 1469113
no-capture | 134611
redirect-loop | 59666
no-pdf-link | 8947
cdx-error | 7561
terminal-bad-status | 6704
null-body | 5042
wrong-mimetype | 879
wayback-error | 722
petabox-error | 198
gateway-timeout | 86
link-loop | 51
invalid-host-resolution | 24
spn2-cdx-lookup-failure | 22
spn2-error | 4
bad-gzip-encoding | 4
spn2-error:job-failed | 2
(18 rows)
Dump ingest requests:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'oai'
AND date(ingest_request.created) > '2020-05-01'
AND ingest_file_result.status IS NULL
) TO '/grande/snapshots/oai_noingest_20200506.requests.json';
=> COPY 49491452
cat /grande/snapshots/oai_noingest_20200506.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
|