notes/cleanups/wayback_timestamps.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


At some point, using the arabesque importer (from targetted crawling), we
accidentially imported a bunch of files with wayback URLs that have 12-digit
timestamps, instead of the full canonical 14-digit timestamps.


## Prep (2021-11-04)

Download most recent file export:

    wget https://archive.org/download/fatcat_bulk_exports_2021-10-07/file_export.json.gz

Filter to files with problem of interest:

    zcat file_export.json.gz \
        | pv -l \
        | rg 'web.archive.org/web/\d{12}/' \
        | gzip \
        > files_20211007_shortts.json.gz
    # 111M 0:12:35

    zcat files_20211007_shortts.json.gz | wc -l
    # 7,935,009

    zcat files_20211007_shortts.json.gz | shuf -n10000 > files_20211007_shortts.10k_sample.json

Wow, this is a lot more than I thought!

## Fetch Complete URL

Want to export JSON like:

    file_entity
        [existing file entity]
    full_urls[]
        <short>: <long>
    status: str

Status one of:

- 'success-self': the file already has a fixed URL internally
- 'success-db': lookup URL against sandcrawler-db succeeded, and SHA1 matched
- 'success-cdx': CDX API lookup succeeded, and SHA1 matched
- 'fail-hash': found a CDX record, but wrong hash
- 'fail-not-found': no matching CDX record found