blob: 74aeb8dd9af89cda6f58e5674c89c785130409f6 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
Cross-posting from fatcat bulk metadata update/ingest.
zcat dblp_sandcrawler_ingest_requests.json.gz | rg -v "\\\\" | jq . -c | pv -l | kafkacat -P -b wbgrp-svc350.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
# 631k 0:00:11 [54.0k/s]
## Post-Crawl Stats
This is after bulk ingest, crawl, and a bit of "live" re-ingest. Query run
2022-09-06:
SELECT ingest_request.ingest_type, ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.link_source = 'dblp'
GROUP BY ingest_request.ingest_type, status
-- ORDER BY ingest_request.ingest_type, COUNT DESC
ORDER BY COUNT DESC
LIMIT 30;
ingest_type | status | count
-------------+-----------------------+--------
pdf | success | 305142
pdf | no-pdf-link | 192683
pdf | no-capture | 42634
pdf | terminal-bad-status | 38041
pdf | skip-url-blocklist | 31055
pdf | link-loop | 9263
pdf | wrong-mimetype | 4545
pdf | redirect-loop | 3952
pdf | empty-blob | 2705
pdf | wayback-content-error | 834
pdf | wayback-error | 294
pdf | petabox-error | 202
pdf | blocked-cookie | 155
pdf | cdx-error | 115
pdf | body-too-large | 66
pdf | bad-redirect | 19
pdf | timeout | 7
pdf | bad-gzip-encoding | 4
(18 rows)
That is quite a lot of `no-pdf-link`, might be worth doing a random sample
and/or re-ingest. And a chunk of `no-capture` to retry.
|