1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
|
## Using Fatcat Tool
Want to enqueue some backfill URLs to crawl, now that SPNv2 is on the mend.
Example dry-run:
./fatcat_ingest.py --dry-run --limit 50 --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --after-year 2020 container --name elife
Big OA from 2020 (past month):
./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --after-year 2020 container --name elife
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 158 release objects in search queries
Counter({'ingest_request': 158, 'estimate': 158, 'kafka': 158, 'elasticsearch_release': 158})
./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org container --name elife
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 2312 release objects in search queries
Counter({'kafka': 2312, 'ingest_request': 2312, 'elasticsearch_release': 2312, 'estimate': 2312})
# note: did 100 first to test
./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --after-year 2020 container --name plos
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 1185 release objects in search queries
Counter({'estimate': 1185, 'ingest_request': 1185, 'elasticsearch_release': 1185, 'kafka': 1185})
./fatcat_ingest.py --limit 500 --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --after-year 2020 container --publisher elsevier
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 89 release objects in search queries
Counter({'elasticsearch_release': 89, 'estimate': 89, 'ingest_request': 89, 'kafka': 89})
./fatcat_ingest.py --limit 500 --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --after-year 2020 container --publisher ieee
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 499 release objects in search queries
Counter({'kafka': 499, 'ingest_request': 499, 'estimate': 499, 'elasticsearch_release': 499})
./fatcat_ingest.py --limit 500 --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --after-year 2020 container --name bmj
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 28 release objects in search queries
Counter({'elasticsearch_release': 28, 'ingest_request': 28, 'kafka': 28, 'estimate': 28})
./fatcat_ingest.py --dry-run --limit 500 --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --after-year 2020 container --publisher springer
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 6225 release objects in search queries
Counter({'estimate': 6225, 'kafka': 500, 'elasticsearch_release': 500, 'ingest_request': 500})
./fatcat_ingest.py --limit 1000 --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa container --container-id zpobyv4vbranllc7oob56tgci4
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 2920 release objects in search queries
Counter({'estimate': 2920, 'elasticsearch_release': 1001, 'ingest_request': 1000, 'kafka': 1000})
Hip corona virus papers:
./fatcat_ingest.py --limit 2000 --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa query coronavirus
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 5332 release objects in search queries
Counter({'estimate': 5332, 'elasticsearch_release': 2159, 'ingest_request': 2000, 'kafka': 2000})
./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa query 2019-nCoV
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 110 release objects in search queries
Counter({'ingest_request': 110, 'kafka': 110, 'elasticsearch_release': 110, 'estimate': 110})
./fatcat_ingest.py --env prod --enqueue-kafka --kafka-hosts wbgrp-svc263.us.archive.org --allow-non-oa query MERS-CoV
Will send ingest requests to kafka topic: sandcrawler-prod.ingest-file-requests
Expecting 589 release objects in search queries
Counter({'estimate': 589, 'elasticsearch_release': 589, 'ingest_request': 552, 'kafka': 552})
Mixed eLife results:
["wrong-mimetype",null,"https://elifesciences.org/articles/54551"]
["success",null,"https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvNTE2OTEvZWxpZmUtNTE2OTEtdjEucGRm/elife-51691-v1.pdf?_hash=Jp1cLog1NzIlU%2BvjgLdbM%2BuphOwe5QWUn%2F97tbQBNG4%3D"]
## Re-Request Failed
Select some failed injest request rows to re-enqueue:
COPY (
SELECT row_to_json(ingest_request.*) FROM ingest_request
LEFT JOIN ingest_file_result ON ingest_file_result.base_url = ingest_request.base_url
WHERE ingest_request.ingest_type = 'pdf'
AND ingest_file_result.ingest_type = 'pdf'
AND ingest_file_result.updated < NOW() - '2 day'::INTERVAL
AND ingest_file_result.hit = false
AND ingest_file_result.status = 'spn2-cdx-lookup-failure'
) TO '/grande/snapshots/reingest_spn2cdx_20200205.rows.json';
-- 1536 rows
Transform back to full requests:
./scripts/ingestrequest_row2json.py reingest_spn2cdx_20200205.rows.json > reingest_spn2cdx_20200205.json
Push into kafka (on a kafka broker node):
cat ~/reingest_spn2cdx_20200205.json | jq . -c | kafkacat -P -b localhost -t sandcrawler-prod.ingest-file-requests -p -1
More:
COPY (
SELECT row_to_json(ingest_request.*) FROM ingest_request
LEFT JOIN ingest_file_result ON ingest_file_result.base_url = ingest_request.base_url
WHERE ingest_request.ingest_type = 'pdf'
AND ingest_file_result.ingest_type = 'pdf'
AND ingest_file_result.updated < NOW() - '2 day'::INTERVAL
AND ingest_file_result.hit = false
AND ingest_file_result.status like 'error:%'
) TO '/grande/snapshots/reingest_spn2err1_20200205.rows.json';
-- COPY 1516
COPY (
SELECT row_to_json(ingest_request.*) FROM ingest_request
LEFT JOIN ingest_file_result ON ingest_file_result.base_url = ingest_request.base_url
WHERE ingest_request.ingest_type = 'pdf'
AND ingest_file_result.ingest_type = 'pdf'
AND ingest_file_result.updated < NOW() - '2 day'::INTERVAL
AND ingest_file_result.hit = false
AND ingest_file_result.status like 'spn2-error%'
) TO '/grande/snapshots/reingest_spn2err2_20200205.rows.json';
-- COPY 16678
The next large ones to try would be `wayback-error` and `cdx-error`, though
these are pretty generic. Could go kafka output to try and understand those
error classes better.
Oof, as a mistake enqueued to partition 1 instead of -1 (random), so these will
take a week or more to actually process. Re-enqueued as -1; ingesting from
wayback is pretty fast, this should result mostly wayback ingests. Caught up by
end of weekend?
## Check Coverages
As follow-ups:
elife: https://fatcat.wiki/container/en4qj5ijrbf5djxx7p5zzpjyoq/coverage
archivist: https://fatcat.wiki/container/zpobyv4vbranllc7oob56tgci4/coverage
=> 85 preserved
jcancer: https://fatcat.wiki/container/nkkzpwht7jd3zdftc6gq4eoeey/coverage
=> 2020 preserved
=> 2520 preserved
plos: https://fatcat.wiki/container/23nqq3odsjhmbi5tqavvcn7cfm/coverage
|