1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
|
At some point, using the arabesque importer (from targetted crawling), we
accidentially imported a bunch of files with wayback URLs that have 12-digit
timestamps, instead of the full canonical 14-digit timestamps.
## Prep (2021-11-04)
Download most recent file export:
wget https://archive.org/download/fatcat_bulk_exports_2021-10-07/file_export.json.gz
Filter to files with problem of interest:
zcat file_export.json.gz \
| pv -l \
| rg 'web.archive.org/web/\d{12}/' \
| gzip \
> files_20211007_shortts.json.gz
# 111M 0:12:35
zcat files_20211007_shortts.json.gz | wc -l
# 7,935,009
zcat files_20211007_shortts.json.gz | shuf -n10000 > files_20211007_shortts.10k_sample.json
Wow, this is a lot more than I thought!
There might also be some other short URL patterns, check for those:
zcat file_export.json.gz \
| pv -l \
| rg 'web.archive.org/web/\d{1,11}/' \
| gzip \
> files_20211007_veryshortts.json.gz
# skipped, mergine with below
zcat file_export.json.gz \
| rg 'web.archive.org/web/None/' \
| pv -l \
> /dev/null
# 0.00 0:10:06 [0.00 /s]
# whew, that pattern has been fixed it seems
zcat file_export.json.gz | rg '/None/' | pv -l > /dev/null
# 2.00 0:10:01 [3.33m/s]
zcat file_export.json.gz \
| rg 'web.archive.org/web/\d{13}/' \
| pv -l \
> /dev/null
# 0.00 0:10:09 [0.00 /s]
Yes, 4-digit is a popular pattern as well, need to handle those:
zcat file_export.json.gz \
| pv -l \
| rg 'web.archive.org/web/\d{4,12}/' \
| gzip \
> files_20211007_moreshortts.json.gz
# 111M 0:13:22 [ 139k/s]
zcat files_20211007_moreshortts.json.gz | wc -l
zcat files_20211007_moreshortts.json.gz | shuf -n10000 > files_20211007_moreshortts.10k_sample.json
# 9,958,854
## Fetch Complete URL
Want to export JSON like:
file_entity
[existing file entity]
full_urls[]: list of Dicts[str,str]
<short_url>: <full_url>
status: str
Status one of:
- 'success-self': the file already has a fixed URL internally
- 'success-db': lookup URL against sandcrawler-db succeeded, and SHA1 matched
- 'success-cdx': CDX API lookup succeeded, and SHA1 matched
- 'fail-not-found': no matching CDX record found
Ran over a sample:
cat files_20211007_shortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json
cat sample_out.json | jq .status | sort | uniq -c
5 "fail-not-found"
576 "success-api"
7212 "success-db"
2207 "success-self"
head -n1000 | ./fetch_full_cdx_ts.py > sample_out.json
zcat files_20211007_veryshortts.json.gz | head -n1000 | ./fetch_full_cdx_ts.py | jq .status | sort | uniq -c
2 "fail-not-found"
168 "success-api"
208 "success-db"
622 "success-self"
Investigating the "fail-not-found", they look like http/https URL
not-exact-matches. Going to put off handling these for now because it is a
small fraction and more delicate.
Again with the broader set:
cat files_20211007_moreshortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json
cat sample_out.json | jq .status | sort | uniq -c
9 "fail-not-found"
781 "success-api"
6175 "success-db"
3035 "success-self"
## Cleanup Process
Other possible cleanups to run at the same time, which would not require
external requests or other context:
- URL has ://archive.org/ link with rel=repository => rel=archive
- mimetype is bogus => clean mimetype
- bogus file => set some new extra field, like scope=stub or scope=partial (?)
It looks like the rel swap is already implemented in `generic_file_cleanups()`.
From sampling it seems like the mimetype issue is pretty small, so not going to
bite that off now. The "bogus file" issue requires thought, so also skipping.
## Commands
Running with 8x parallelism to not break things; expecting some errors along
the way, may need to add handlers for connection errors etc:
zcat files_20211007_moreshortts.json.gz \
| parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
| pv -l \
| gzip \
> files_20211007_moreshortts.fetched.json.gz
At 300 records/sec, this should take around 9-10 hours to process.
|