diff options
Diffstat (limited to 'notes/cleanups/wayback_timestamps.md')
-rw-r--r-- | notes/cleanups/wayback_timestamps.md | 45 |
1 files changed, 45 insertions, 0 deletions
diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md new file mode 100644 index 00000000..c70ec5b2 --- /dev/null +++ b/notes/cleanups/wayback_timestamps.md @@ -0,0 +1,45 @@ + +At some point, using the arabesque importer (from targetted crawling), we +accidentially imported a bunch of files with wayback URLs that have 12-digit +timestamps, instead of the full canonical 14-digit timestamps. + + +## Prep (2021-11-04) + +Download most recent file export: + + wget https://archive.org/download/fatcat_bulk_exports_2021-10-07/file_export.json.gz + +Filter to files with problem of interest: + + zcat file_export.json.gz \ + | pv -l \ + | rg 'web.archive.org/web/\d{12}/' \ + | gzip \ + > files_20211007_shortts.json.gz + # 111M 0:12:35 + + zcat files_20211007_shortts.json.gz | wc -l + # 7,935,009 + + zcat files_20211007_shortts.json.gz | shuf -n10000 > files_20211007_shortts.10k_sample.json + +Wow, this is a lot more than I thought! + +## Fetch Complete URL + +Want to export JSON like: + + file_entity + [existing file entity] + full_urls[] + <short>: <long> + status: str + +Status one of: + +- 'success-self': the file already has a fixed URL internally +- 'success-db': lookup URL against sandcrawler-db succeeded, and SHA1 matched +- 'success-cdx': CDX API lookup succeeded, and SHA1 matched +- 'fail-hash': found a CDX record, but wrong hash +- 'fail-not-found': no matching CDX record found |