diff options
Diffstat (limited to 'notes/cleanups/wayback_timestamps.md')
| -rw-r--r-- | notes/cleanups/wayback_timestamps.md | 103 | 
1 files changed, 100 insertions, 3 deletions
| diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md index c70ec5b2..81785992 100644 --- a/notes/cleanups/wayback_timestamps.md +++ b/notes/cleanups/wayback_timestamps.md @@ -26,14 +26,53 @@ Filter to files with problem of interest:  Wow, this is a lot more than I thought! +There might also be some other short URL patterns, check for those: + +    zcat file_export.json.gz \ +        | pv -l \ +        | rg 'web.archive.org/web/\d{1,11}/' \ +        | gzip \ +        > files_20211007_veryshortts.json.gz +    # skipped, mergine with below + +    zcat file_export.json.gz \ +        | rg 'web.archive.org/web/None/' \ +        | pv -l \ +        > /dev/null +    # 0.00  0:10:06 [0.00 /s] +    # whew, that pattern has been fixed it seems + +    zcat file_export.json.gz | rg '/None/' | pv -l > /dev/null +    # 2.00  0:10:01 [3.33m/s] + +    zcat file_export.json.gz \ +        | rg 'web.archive.org/web/\d{13}/' \ +        | pv -l \ +        > /dev/null +    # 0.00  0:10:09 [0.00 /s] + +Yes, 4-digit is a popular pattern as well, need to handle those: + +    zcat file_export.json.gz \ +        | pv -l \ +        | rg 'web.archive.org/web/\d{4,12}/' \ +        | gzip \ +        > files_20211007_moreshortts.json.gz +    # 111M 0:13:22 [ 139k/s] + +    zcat files_20211007_moreshortts.json.gz | wc -l + +    zcat files_20211007_moreshortts.json.gz | shuf -n10000 > files_20211007_moreshortts.10k_sample.json +    # 9,958,854 +  ## Fetch Complete URL  Want to export JSON like:      file_entity          [existing file entity] -    full_urls[] -        <short>: <long> +    full_urls[]: list of Dicts[str,str] +        <short_url>: <full_url>      status: str  Status one of: @@ -41,5 +80,63 @@ Status one of:  - 'success-self': the file already has a fixed URL internally  - 'success-db': lookup URL against sandcrawler-db succeeded, and SHA1 matched  - 'success-cdx': CDX API lookup succeeded, and SHA1 matched -- 'fail-hash': found a CDX record, but wrong hash  - 'fail-not-found': no matching CDX record found + +Ran over a sample: + +    cat files_20211007_shortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json + +    cat sample_out.json | jq .status | sort | uniq -c +          5 "fail-not-found" +        576 "success-api" +       7212 "success-db" +       2207 "success-self" + +    head -n1000  | ./fetch_full_cdx_ts.py > sample_out.json + +    zcat files_20211007_veryshortts.json.gz | head -n1000 | ./fetch_full_cdx_ts.py | jq .status | sort | uniq -c +          2 "fail-not-found" +        168 "success-api" +        208 "success-db" +        622 "success-self" + +Investigating the "fail-not-found", they look like http/https URL +not-exact-matches. Going to put off handling these for now because it is a +small fraction and more delicate. + +Again with the broader set: + +    cat files_20211007_moreshortts.10k_sample.json | ./fetch_full_cdx_ts.py > sample_out.json + +    cat sample_out.json | jq .status | sort | uniq -c +          9 "fail-not-found" +        781 "success-api" +       6175 "success-db" +       3035 "success-self" + + +## Cleanup Process + +Other possible cleanups to run at the same time, which would not require +external requests or other context: + +- URL has ://archive.org/ link with rel=repository => rel=archive +- mimetype is bogus => clean mimetype +- bogus file => set some new extra field, like scope=stub or scope=partial (?) + +It looks like the rel swap is already implemented in `generic_file_cleanups()`. +From sampling it seems like the mimetype issue is pretty small, so not going to +bite that off now. The "bogus file" issue requires thought, so also skipping. + +## Commands + +Running with 8x parallelism to not break things; expecting some errors along +the way, may need to add handlers for connection errors etc: + +    zcat files_20211007_moreshortts.json.gz \ +        | parallel -j8 --linebuffer --round-robin --pipe ./fetch_full_cdx_ts.py \ +        | pv -l \ +        | gzip \ +        > files_20211007_moreshortts.fetched.json.gz + +At 300 records/sec, this should take around 9-10 hours to process. | 
