At some point, using the arabesque importer (from targeted crawling), we
accidentally imported a bunch of files with wayback URLs that have 12-digit
timestamps, instead of the full canonical 14-digit timestamps.


## Prep (2021-11-04)

Download most recent file export:

    wget https://archive.org/download/fatcat_bulk_exports_2021-10-07/file_export.json.gz

Filter to files with problem of interest:

    zcat file_export.json.gz \
        | pv -l \
        | rg 'web.archive.org/web/\d{12}/' \
        | gzip \
        > files_20211007_shortts.json.gz
    # 111M 0:12:35

    zcat files_20211007_shortts.json.gz | wc -l
    # 7,935,009

    zcat files_20211007_shortts.json.gz | shuf -n10000 > files_20211007_shortts.10k_sample.json

Wow, this is a lot more than I thought!

There might also be some other short URL patterns, check for those:

    zcat file_export.json.gz \
        | pv -l \
        | rg 'web.archive.org/web/\d{1,11}/' \
        | gzip \
        > files_20211007_veryshortts.json.gz
    # skipped, mergine with below

    zcat file_export.json.gz \
        | rg 'web.archive.org/web/None/' \
        | pv -l \
        > /dev/null
    # 0.00  0:10:06 [0.00 /s]
    # whew, that pattern has been fixed it seems

    zcat file_export.json.gz | rg '/None/' | pv -l > /dev/null
    # 2.00  0:10:01 [3.33m/s]

    zcat file_export.json.gz \
        | rg 'web.archive.org/web/\d{13}/' \
        | pv -l \
        > /dev/null
    # 0.00  0:10:09 [0.00 /s]

Yes, 4-digit is a popular pattern as well, need to handle those:

    zcat file_export.json.gz \
        | pv -l \
        | rg 'web.archive.org/web/\d{4,12}/' \
        | gzip \
        > files_20211007_moreshortts.json.gz
    # 111M 0:13:22 [ 139k/s]

    zcat files_20211007_moreshortts.json.gz | wc -l
    # 9,958,854

    zcat files_20211007_moreshortts.json.gz | shuf -n10000 > files_20211007_moreshortts.10k_sample.json


## Fetch Complete URL

Want to export JSON like:

    file_entity
        [existing file entity]
    full_urls[]: list of Dicts[str,str]
        <short_url>: <full_url>
    status: str

Status one of:

- 'success-self': the file already has a fixed URL internally
- 'success-db': lookup URL against sandcrawler-db succeeded, and SHA1 matched
- 'success-cdx': CDX API lookup succeeded, and SHA1 matched
- 'fail-not-found': no matching CDX record found

Ran over a sample:

    cat files_20211007_shortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json

    cat sample_out.json | jq .status | sort | uniq -c
          5 "fail-not-found"
        576 "success-api"
       7212 "success-db"
       2207 "success-self"

    head -n1000  | ./fetch_full_cdx_ts.py > sample_out.json

    zcat files_20211007_veryshortts.json.gz | head -n1000 | ./fetch_full_cdx_ts.py | jq .status | sort | uniq -c
          2 "fail-not-found"
        168 "success-api"
        208 "success-db"
        622 "success-self"

Investigating the "fail-not-found", they look like http/https URL
not-exact-matches. Going to put off handling these for now because it is a
small fraction and more delicate.

Again with the broader set:

    cat files_20211007_moreshortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json

    cat sample_out.json | jq .status | sort | uniq -c
          9 "fail-not-found"
        781 "success-api"
       6175 "success-db"
       3035 "success-self"

While running a larger batch, got a CDX API error:

    requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.psychologytoday.com%2Ffiles%2Fu47%2FHenry_et_al.pdf&from=2017&to=2017&matchType=exact&output=json&limit=20

    org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.AdministrativeAccessControlException: Blocked Site Error

So maybe need to use credentials after all.


## Cleanup Process

Other possible cleanups to run at the same time, which would not require
external requests or other context:

- URL has ://archive.org/ link with rel=repository => rel=archive
- mimetype is bogus => clean mimetype
- bogus file => set some new extra field, like scope=stub or scope=partial (?)

It looks like the rel swap is already implemented in `generic_file_cleanups()`.
From sampling it seems like the mimetype issue is pretty small, so not going to
bite that off now. The "bogus file" issue requires thought, so also skipping.


## Commands (old)

Running with 8x parallelism to not break things; expecting some errors along
the way, may need to add handlers for connection errors etc:

    # OLD SNAPSHOT
    zcat files_20211007_moreshortts.json.gz \
        | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
        | pv -l \
        | gzip \
        > files_20211007_moreshortts.fetched.json.gz

At 300 records/sec, this should take around 9-10 hours to process.


## Prep Again (2021-11-09)

After fixing "sort" issue and re-dumping file entities (2021-11-05 snapshot).

Filter again:

    # note: in the future use pigz instead of gzip here
    zcat file_export.json.gz \
        | pv -l \
        | rg 'web.archive.org/web/\d{4,12}/' \
        | gzip \
        > files_20211105_moreshortts.json.gz
    # 112M 0:13:27 [ 138k/s]

    zcat files_20211105_moreshortts.json.gz | wc -l
    # 9,958,854
    # good, exact same number as previous snapshot

    zcat files_20211105_moreshortts.json.gz | shuf -n10000 > files_20211105_moreshortts.10k_sample.json
    # done

    cat files_20211105_moreshortts.10k_sample.json \
        | ./fetch_full_cdx_ts.py \
        | pv -l \
        > files_20211105_moreshortts.10k_sample.fetched.json
    # 10.0k 0:03:36 [46.3 /s]

    cat files_20211105_moreshortts.10k_sample.fetched.json | jq .status | sort | uniq -c
         13 "fail-not-found"
        774 "success-api"
       6193 "success-db"
       3020 "success-self"

After tweaking `success-self` logic:

         13 "fail-not-found"
        859 "success-api"
       6229 "success-db"
       2899 "success-self"


## Testing in QA

Copied `sample_out.json` to fatcat QA instance and renamed as `files_20211007_moreshortts.10k_sample.fetched.json`

    # OLD ATTEMPT
    export FATCAT_API_AUTH_TOKEN=[...]
    head -n10 /srv/fatcat/datasets/files_20211007_moreshortts.10k_sample.fetched.json \
        | python -m fatcat_tools.cleanups.file_short_wayback_ts -

Ran in to issues, iterated above.

Trying again with updated script and sample file:

    export FATCAT_AUTH_WORKER_CLEANUP=[...]

    head -n10 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
        | python -m fatcat_tools.cleanups.file_short_wayback_ts -
    # Counter({'total': 10, 'update': 10, 'skip': 0, 'insert': 0, 'exists': 0})

Manually inspected and these look good. Trying some repeats and larger batched:

    head -n10 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
        | python -m fatcat_tools.cleanups.file_short_wayback_ts -
    # Counter({'total': 10, 'skip-revision-changed': 10, 'skip': 0, 'insert': 0, 'update': 0, 'exists': 0})

    head -n1000 /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
        | python -m fatcat_tools.cleanups.file_short_wayback_ts -

    [...]
    bad replacement URL: partial_ts=201807271139 original=http://www.scielo.br/pdf/qn/v20n1/4918.pdf fix_url=https://web.archive.org/web/20170819080342/http://www.scielo.br/pdf/qn/v20n1/4918.pdf
    bad replacement URL: partial_ts=201904270207 original=https://www.matec-conferences.org/articles/matecconf/pdf/2018/62/matecconf_iccoee2018_03008.pdf fix_url=https://web.archive.org/web/20190501060839/https://www.matec-conferences.org/articles/matecconf/pdf/2018/62/matecconf_iccoee2018_03008.pdf
    bad replacement URL: partial_ts=201905011445 original=https://cdn.intechopen.com/pdfs/5886.pdf fix_url=https://web.archive.org/web/20190502203832/https://cdn.intechopen.com/pdfs/5886.pdf
    [...]

    # Counter({'total': 1000, 'update': 969, 'skip': 19, 'skip-bad-replacement': 18, 'skip-revision-changed': 10, 'skip-bad-wayback-timestamp': 2, 'skip-status': 1, 'insert': 0, 'exists': 0})


It looks like these "bad replacement URLs" are due to timestamp mismatches. Eg, the partial timestamp is not part of the final timestamp.

Tweaked fetch script and re-ran:

    # Counter({'total': 1000, 'skip-revision-changed': 979, 'update': 18, 'skip-bad-wayback-timestamp': 2, 'skip': 1, 'skip-status': 1, 'insert': 0, 'exists': 0})

Cool. Sort of curious what the deal is with those `skip-bad-wayback-timestamp`.

Run the rest through:

    cat /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
        | python -m fatcat_tools.cleanups.file_short_wayback_ts -
    # Counter({'total': 10000, 'update': 8976, 'skip-revision-changed': 997, 'skip-bad-wayback-timestamp': 14, 'skip': 13, 'skip-status': 13, 'insert': 0, 'exists': 0})

Should tweak batch size to 100 (vs. 50).

How to parallelize import:

    # from within pipenv
    cat /srv/fatcat/datasets/files_20211105_moreshortts.10k_sample.fetched.json \
        | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_tools.cleanups.file_short_wayback_ts -


## Full Batch Commands

Running in bulk again:

    zcat files_20211105_moreshortts.json.gz \
        | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
        | pv -l \
        | gzip \
        > files_20211105_moreshortts.fetched.json.gz

Ran in to one: `requests.exceptions.HTTPError: 503 Server Error: Service
Temporarily Unavailable for url: [...]`. Will try again, if there are more
failures may need to split up in smaller chunks.

Unexpected:

    Traceback (most recent call last):
      File "./fetch_full_cdx_ts.py", line 200, in <module>
        main()
      File "./fetch_full_cdx_ts.py", line 197, in main
        print(json.dumps(process_file(fe, session=session)))
      File "./fetch_full_cdx_ts.py", line 118, in process_file
        assert seg[4].isdigit()
    AssertionError
    3.96M 3:04:46 [ 357 /s]

Ugh.

    zcat files_20211105_moreshortts.json.gz \
        | tac \
        | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
        | pv -l \
        | gzip \
        > files_20211105_moreshortts.fetched.json.gz
    # 9.96M 6:38:43 [ 416 /s]

Looks like the last small tweak was successful! This was with git commit
`cd09c6d6bd4deef0627de4f8a8a301725db01e14`.


    zcat files_20211105_moreshortts.fetched.json.gz | jq .status | sort | uniq -c | sort -nr
      6228307 "success-db"
      2876033 "success-self"
       846844 "success-api"
         7583 "fail-not-found"
           87 "fail-cdx-403"

## Follow-up (2021-11-16)

Both re-fetching with updated file export, and also fixed a small one-line bug
in `fetch_full_cdx_ts.py` which was missing most multi-URL file cleanups.

    zcat file_export.json.gz \
        | pv -l \
        | rg 'web.archive.org/web/\d{4,12}/' \
        | gzip \
        > files_20211127_moreshortts.json.gz
    # 112M 0:09:38 [ 193k/s]

    zcat files_20211127_moreshortts.json.gz | wc -l
    # 29,494

    zcat files_20211127_moreshortts.json.gz \
        | parallel -j6 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \
        | pv -l \
        | gzip \
        > files_20211127_moreshortts.fetched.json.gz
    # 29.5k 0:14:33 [33.8 /s]

    zcat files_20211127_moreshortts.fetched.json.gz | jq .status | sort | uniq -c | sort -nr
      21376 "success-api"
       7576 "fail-not-found"
        439 "success-self"
         87 "fail-cdx-403"
         16 "success-db"