1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
|
At some point, using the arabesque importer (from targeted crawling), we
accidentally imported a bunch of files with wayback URLs that have 12-digit
timestamps, instead of the full canonical 14-digit timestamps.
## Prep (2021-11-04)
Download most recent file export:
wget https://archive.org/download/fatcat_bulk_exports_2021-10-07/file_export.json.gz
Filter to files with problem of interest:
zcat file_export.json.gz \
| pv -l \
| rg 'web.archive.org/web/\d{12}/' \
| gzip \
> files_20211007_shortts.json.gz
# 111M 0:12:35
zcat files_20211007_shortts.json.gz | wc -l
# 7,935,009
zcat files_20211007_shortts.json.gz | shuf -n10000 > files_20211007_shortts.10k_sample.json
Wow, this is a lot more than I thought!
There might also be some other short URL patterns, check for those:
zcat file_export.json.gz \
| pv -l \
| rg 'web.archive.org/web/\d{1,11}/' \
| gzip \
> files_20211007_veryshortts.json.gz
# skipped, mergine with below
zcat file_export.json.gz \
| rg 'web.archive.org/web/None/' \
| pv -l \
> /dev/null
# 0.00 0:10:06 [0.00 /s]
# whew, that pattern has been fixed it seems
zcat file_export.json.gz | rg '/None/' | pv -l > /dev/null
# 2.00 0:10:01 [3.33m/s]
zcat file_export.json.gz \
| rg 'web.archive.org/web/\d{13}/' \
| pv -l \
> /dev/null
# 0.00 0:10:09 [0.00 /s]
Yes, 4-digit is a popular pattern as well, need to handle those:
zcat file_export.json.gz \
| pv -l \
| rg 'web.archive.org/web/\d{4,12}/' \
| gzip \
> files_20211007_moreshortts.json.gz
# 111M 0:13:22 [ 139k/s]
zcat files_20211007_moreshortts.json.gz | wc -l
# 9,958,854
zcat files_20211007_moreshortts.json.gz | shuf -n10000 > files_20211007_moreshortts.10k_sample.json
## Fetch Complete URL
Want to export JSON like:
file_entity
[existing file entity]
full_urls[]: list of Dicts[str,str]
<short_url>: <full_url>
status: str
Status one of:
- 'success-self': the file already has a fixed URL internally
- 'success-db': lookup URL against sandcrawler-db succeeded, and SHA1 matched
- 'success-cdx': CDX API lookup succeeded, and SHA1 matched
- 'fail-not-found': no matching CDX record found
Ran over a sample:
cat files_20211007_shortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json
cat sample_out.json | jq .status | sort | uniq -c
5 "fail-not-found"
576 "success-api"
7212 "success-db"
2207 "success-self"
head -n1000 | ./fetch_full_cdx_ts.py > sample_out.json
zcat files_20211007_veryshortts.json.gz | head -n1000 | ./fetch_full_cdx_ts.py | jq .status | sort | uniq -c
2 "fail-not-found"
168 "success-api"
208 "success-db"
622 "success-self"
Investigating the "fail-not-found", they look like http/https URL
not-exact-matches. Going to put off handling these for now because it is a
small fraction and more delicate.
Again with the broader set:
cat files_20211007_moreshortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json
cat sample_out.json | jq .status | sort | uniq -c
9 "fail-not-found"
781 "success-api"
6175 "success-db"
3035 "success-self"
While running a larger batch, got a CDX API error:
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.psychologytoday.com%2Ffiles%2Fu47%2FHenry_et_al.pdf&from=2017&to=2017&matchType=exact&output=json&limit=20
org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error
So maybe need to use credentials after all.
## Cleanup Process
Other possible cleanups to run at the same time, which would not require
external requests or other context:
- URL has ://archive.org/ link with rel=repository => rel=archive
- mimetype is bogus => clean mimetype
- bogus file => set some new extra field, like scope=stub or scope=partial (?)
It looks like the rel swap is already implemented in `generic_file_cleanups()`.
From sampling it seems like the mimetype issue is pretty small, so not going to
bite that off now. The "bogus file" issue requires thought, so also skipping.
## Commands (old)
Running with 8x parallelism to not break things; expecting some errors along
the way, may need to add handlers for connection errors etc:
# OLD SNAPSHOT
zcat files_20211007_moreshortts.json.gz \
| parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
| pv -l \
| gzip \
> files_20211007_moreshortts.fetched.json.gz
At 300 records/sec, this should take around 9-10 hours to process.
## Prep Again (2021-11-09)
After fixing "sort" issue and re-dumping file entities (2021-11-05 snapshot).
Filter again:
# note: in the future use pigz instead of gzip here
zcat file_export.json.gz \
| pv -l \
| rg 'web.archive.org/web/\d{4,12}/' \
| gzip \
> files_20211105_moreshortts.json.gz
# 112M 0:13:27 [ 138k/s]
zcat files_20211105_moreshortts.json.gz | wc -l
# 9,958,854
# good, exact same number as previous snapshot
zcat files_20211105_moreshortts.json.gz | shuf -n10000 > files_20211105_moreshortts.10k_sample.json
# done
cat files_20211105_moreshortts.10k_sample.json \
| ./fetch_full_cdx_ts.py \
| pv -l \
> files_20211105_moreshortts.10k_sample.fetched.json
# 10.0k 0:03:36 [46.3 /s]
cat files_20211105_moreshortts.10k_sample.fetched.json | jq .status | sort | uniq -c
13 "fail-not-found"
774 "success-api"
6193 "success-db"
3020 "success-self"
After tweaking `success-self` logic:
13 "fail-not-found"
859 "success-api"
6229 "success-db"
2899 "success-self"
## Testing in QA
Copied `sample_out.json` to fatcat QA instance and renamed as `files_20211007_moreshortts.10k_sample.fetched.json`
# OLD ATTEMPT
export FATCAT_API_AUTH_TOKEN=[...]
head -n10 /srv/fatcat/datasets/files_20211007_moreshortts.10k_sample.fetched.json \
| python -m fatcat_tools.cleanups.file_short_wayback_ts -
Ran in to issues, iterated above.
Trying again with updated script and sample file:
export FATCAT_AUTH_WORKER_CLEANUP=[...]
head -n10 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
| python -m fatcat_tools.cleanups.file_short_wayback_ts -
# Counter({'total': 10, 'update': 10, 'skip': 0, 'insert': 0, 'exists': 0})
Manually inspected and these look good. Trying some repeats and larger batched:
head -n10 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
| python -m fatcat_tools.cleanups.file_short_wayback_ts -
# Counter({'total': 10, 'skip-revision-changed': 10, 'skip': 0, 'insert': 0, 'update': 0, 'exists': 0})
head -n1000 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
| python -m fatcat_tools.cleanups.file_short_wayback_ts -
[...]
bad replacement URL: partial_ts=201807271139 original=http://www.scielo.br/pdf/qn/v20n1/4918.pdf fix_url=https://web.archive.org/web/20170819080342/http://www.scielo.br/pdf/qn/v20n1/4918.pdf
bad replacement URL: partial_ts=201904270207 original=https://www.matec-conferences.org/articles/matecconf/pdf/2018/62/matecconf_iccoee2018_03008.pdf fix_url=https://web.archive.org/web/20190501060839/https://www.matec-conferences.org/articles/matecconf/pdf/2018/62/matecconf_iccoee2018_03008.pdf
bad replacement URL: partial_ts=201905011445 original=https://cdn.intechopen.com/pdfs/5886.pdf fix_url=https://web.archive.org/web/20190502203832/https://cdn.intechopen.com/pdfs/5886.pdf
[...]
# Counter({'total': 1000, 'update': 969, 'skip': 19, 'skip-bad-replacement': 18, 'skip-revision-changed': 10, 'skip-bad-wayback-timestamp': 2, 'skip-status': 1, 'insert': 0, 'exists': 0})
It looks like these "bad replacement URLs" are due to timestamp mismatches. Eg, the partial timestamp is not part of the final timestamp.
Tweaked fetch script and re-ran:
# Counter({'total': 1000, 'skip-revision-changed': 979, 'update': 18, 'skip-bad-wayback-timestamp': 2, 'skip': 1, 'skip-status': 1, 'insert': 0, 'exists': 0})
Cool. Sort of curious what the deal is with those `skip-bad-wayback-timestamp`.
Run the rest through:
cat /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
| python -m fatcat_tools.cleanups.file_short_wayback_ts -
# Counter({'total': 10000, 'update': 8976, 'skip-revision-changed': 997, 'skip-bad-wayback-timestamp': 14, 'skip': 13, 'skip-status': 13, 'insert': 0, 'exists': 0})
Should tweak batch size to 100 (vs. 50).
How to parallelize import:
# from within pipenv
cat /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
| parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_short_wayback_ts -
## Full Batch Commands
Running in bulk again:
zcat files_20211105_moreshortts.json.gz \
| parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
| pv -l \
| gzip \
> files_20211105_moreshortts.fetched.json.gz
Ran in to one: `requests.exceptions.HTTPError: 503 Server Error: Service
Temporarily Unavailable for url: [...]`. Will try again, if there are more
failures may need to split up in smaller chunks.
Unexpected:
Traceback (most recent call last):
File "./fetch_full_cdx_ts.py", line 200, in <module>
main()
File "./fetch_full_cdx_ts.py", line 197, in main
print(json.dumps(process_file(fe, session=session)))
File "./fetch_full_cdx_ts.py", line 118, in process_file
assert seg[4].isdigit()
AssertionError
3.96M 3:04:46 [ 357 /s]
Ugh.
zcat files_20211105_moreshortts.json.gz \
| tac \
| parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
| pv -l \
| gzip \
> files_20211105_moreshortts.fetched.json.gz
# 9.96M 6:38:43 [ 416 /s]
Looks like the last small tweak was successful! This was with git commit
`cd09c6d6bd4deef0627de4f8a8a301725db01e14`.
zcat files_20211105_moreshortts.fetched.json.gz | jq .status | sort | uniq -c | sort -nr
6228307 "success-db"
2876033 "success-self"
846844 "success-api"
7583 "fail-not-found"
87 "fail-cdx-403"
## Follow-up (2021-11-16)
Both re-fetching with updated file export, and also fixed a small one-line bug
in `fetch_full_cdx_ts.py` which was missing most multi-URL file cleanups.
zcat file_export.json.gz \
| pv -l \
| rg 'web.archive.org/web/\d{4,12}/' \
| gzip \
> files_20211127_moreshortts.json.gz
# 112M 0:09:38 [ 193k/s]
zcat files_20211127_moreshortts.json.gz | wc -l
# 29,494
zcat files_20211127_moreshortts.json.gz \
| parallel -j6 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
| pv -l \
| gzip \
> files_20211127_moreshortts.fetched.json.gz
# 29.5k 0:14:33 [33.8 /s]
zcat files_20211127_moreshortts.fetched.json.gz | jq .status | sort | uniq -c | sort -nr
21376 "success-api"
7576 "fail-not-found"
439 "success-self"
87 "fail-cdx-403"
16 "success-db"
|