1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
|
A new snapshot was released in April 2020 (the snapshot is from 2020-02-25, but
not released for more than a month).
Primary goal is:
- generate ingest requests for only *new* URLs
- bulk ingest these new URLs
- crawl any no-capture URLs from that batch
- re-bulk-ingest the no-capture batch
- analytics on failed ingests. eg, any particular domains that are failing to crawl
This ingest pipeline was started on 2020-04-07 by bnewbold.
Ran through the first two steps again on 2020-05-03 after unpaywall had
released another dump (dated 2020-04-27).
## Transform and Load
# in sandcrawler pipenv on aitio
zcat /schnell/UNPAYWALL-PDF-CRAWL-2020-04/unpaywall_snapshot_2020-02-25T115244.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-02-25.ingest_request.json
=> 24.7M 5:17:03 [ 1.3k/s]
cat /grande/snapshots/unpaywall_snapshot_2020-02-25.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
=> 24.7M
=> Worker: Counter({'total': 24712947, 'insert-requests': 4282167, 'update-requests': 0})
Second time:
# in sandcrawler pipenv on aitio
zcat /schnell/UNPAYWALL-PDF-CRAWL-2020-04/unpaywall_snapshot_2020-04-27T153236.jsonl.gz | ./scripts/unpaywall2ingestrequest.py - | pv -l > /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json
=> 25.2M 3:16:28 [2.14k/s]
cat /grande/snapshots/unpaywall_snapshot_2020-04-27.ingest_request.json | pv -l | ./persist_tool.py ingest-request -
=> Worker: Counter({'total': 25189390, 'insert-requests': 1408915, 'update-requests': 0})
=> JSON lines pushed: Counter({'pushed': 25189390, 'total': 25189390})
## Dump new URLs and Bulk Ingest
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-04-01'
AND ingest_file_result.status IS NULL
) TO '/grande/snapshots/unpaywall_noingest_2020-04-08.rows.json';
=> 3696189
WARNING: forgot to transform from rows to ingest requests.
cat /grande/snapshots/unpaywall_noingest_2020-04-08.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
Second time:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-05-01'
AND ingest_file_result.status IS NULL
) TO '/grande/snapshots/unpaywall_noingest_2020-05-03.rows.json';
=> 1799760
WARNING: forgot to transform from rows to ingest requests.
cat /grande/snapshots/unpaywall_noingest_2020-05-03.rows.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
## Dump no-capture, Run Crawl
Make two ingest request dumps: one with "all" URLs, which we will have heritrix
attempt to crawl, and then one with certain domains filtered out, which we may
or may not bother trying to ingest (due to expectation of failure).
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-04-01'
AND ingest_file_result.status = 'no-capture'
) TO '/grande/snapshots/unpaywall_nocapture_all_2020-05-04.rows.json';
=> 2734145
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
LEFT JOIN ingest_file_result
ON ingest_file_result.ingest_type = ingest_request.ingest_type
AND ingest_file_result.base_url = ingest_request.base_url
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-04-01'
AND ingest_file_result.status = 'no-capture'
AND ingest_request.base_url NOT LIKE '%journals.sagepub.com%'
AND ingest_request.base_url NOT LIKE '%pubs.acs.org%'
AND ingest_request.base_url NOT LIKE '%ahajournals.org%'
AND ingest_request.base_url NOT LIKE '%www.journal.csj.jp%'
AND ingest_request.base_url NOT LIKE '%aip.scitation.org%'
AND ingest_request.base_url NOT LIKE '%academic.oup.com%'
AND ingest_request.base_url NOT LIKE '%tandfonline.com%'
) TO '/grande/snapshots/unpaywall_nocapture_2020-05-04.rows.json';
=> 2602408
NOTE: forgot here to transform from "rows" to ingest requests.
Not actually a very significant size difference after all.
See `journal-crawls` repo for details on seedlist generation and crawling.
## Re-Ingest Post-Crawl
NOTE: if we *do* want to do cleanup eventually, could look for fatcat edits
between 2020-04-01 and 2020-05-25 which have limited "extra" metadata (eg, no
evidence or `oa_status`).
The earlier bulk ingests were done wrong (forgot to transform from rows to full
ingest request docs), so going to re-do those, which should be a superset of
the nocapture crawl URLs.:
./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-04-08.rows.json | pv -l > /grande/snapshots/unpaywall_noingest_2020-04-08.json
=> 1.26M 0:00:58 [21.5k/s]
=> previously: 3,696,189
./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_noingest_2020-05-03.rows.json | pv -l > /grande/snapshots/unpaywall_noingest_2020-05-03.json
=> 1.26M 0:00:56 [22.3k/s]
Crap, looks like the 2020-04-08 segment got overwriten with 2020-05 data by
accident. Hrm... need to re-ingest *all* recent unpaywall URLs:
COPY (
SELECT row_to_json(ingest_request.*)
FROM ingest_request
WHERE
ingest_request.ingest_type = 'pdf'
AND ingest_request.link_source = 'unpaywall'
AND date(ingest_request.created) > '2020-04-01'
) TO '/grande/snapshots/unpaywall_all_recent_requests_2020-05-26.rows.json';
=> COPY 5691106
./scripts/ingestrequest_row2json.py /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.rows.json | pv -l | shuf > /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json
=> 5.69M 0:04:26 [21.3k/s]
Start small:
cat /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json | head -n200 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
Looks good (whew), run the full thing:
cat /grande/snapshots/unpaywall_all_recent_requests_2020-05-26.requests.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
|