1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
|
New snapshot released 2020-10-09. Want to do a mostly straight-forward
load/ingest/crawl.
Proposed changes this time around:
- have bulk ingest store missing URLs in a new sandcrawler-db for `no-capture`
status, and to include those URLs in heritrix3 crawl
- tweak heritrix3 config for additional PDF URL extraction patterns,
particularly to improve OJS yield
## Transform and Load
# in sandcrawler pipenv on aitio
zcat /schnell/unpaywall/unpaywall_snapshot_2020-10-09T153852.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-10-09.ingest_request.json
=> 28.3M 3:19:03 [2.37k/s]
cat /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
=> 28.3M 1:11:29 [ 6.6k/s]
=> Worker: Counter({'total': 28298500, 'insert-requests': 4119939, 'update-requests': 0})
=> JSON lines pushed: Counter({'total': 28298500, 'pushed': 28298500})
## Dump new URLs, Transform, Bulk Ingest
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
-- AND date(ingest_request.created) > '2020-10-09'
AND (ingest_file_result.status IS NULL
OR ingest_file_result.status = 'no-capture')
) TO '/grande/snapshots/unpaywall_noingest_2020-10-09.rows.json';
=> COPY 4216339
./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-10-09.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_noingest_2020-10-09.ingest_request.json
=> 4.22M 0:02:48 [ 25k/s]
Start small, to test no-capture behavior:
cat /grande/snapshots/unpaywall_noingest_2020-10-09.ingest_request.json | head -n1000 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
`no-capture` change looks good. Enqueue the whole batch:
cat /grande/snapshots/unpaywall_noingest_2020-10-09.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
Overall status after that:
SELECT ingest_file_result.status, COUNT(*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
GROUP BY status
ORDER BY COUNT DESC
LIMIT 25;
status | count
-------------------------------------+----------
success | 23661084
no-capture | 3015448
no-pdf-link | 2302092
redirect-loop | 1542484
terminal-bad-status | 1044654
wrong-mimetype | 114315
link-loop | 36357
cdx-error | 20055
null-body | 14513
wayback-error | 14175
gateway-timeout | 3747
spn2-cdx-lookup-failure | 1250
petabox-error | 1171
redirects-exceeded | 752
invalid-host-resolution | 464
bad-redirect | 131
spn2-error | 109
spn2-error:job-failed | 91
timeout | 19
| 13
spn2-error:soft-time-limit-exceeded | 9
bad-gzip-encoding | 6
spn2-error:pending | 1
skip-url-blocklist | 1
pending | 1
(25 rows)
## Crawl
Re-crawl broadly (eg, all URLs that have failed before, not just `no-capture`):
COPY (
SELECT row_to_json(r) FROM (
SELECT ingest_request.*, ingest_file_result.terminal_url as terminal_url
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.ingest_request_source = 'unpaywall'
AND ingest_file_result.status != 'success'
) r
) TO '/grande/snapshots/oa_doi_reingest_recrawl_20201014.rows.json';
=> 8111845
Hrm. Not sure how to feel about the no-pdf-link. Guess it is fine!
|