diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-04 14:00:56 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-09 14:17:35 -0800 |
commit | 1927a7da466164010f0a6467f4df0c887ba00ad3 (patch) | |
tree | 53d6228a8cadb083942163585663acc275152830 /notes/cleanups/wayback_timestamps.md | |
parent | a6d994fbc18debcf3860e6deb12eb54234a42839 (diff) | |
download | fatcat-1927a7da466164010f0a6467f4df0c887ba00ad3.tar.gz fatcat-1927a7da466164010f0a6467f4df0c887ba00ad3.zip |
start work on wayback short-timestamp cleanup
Diffstat (limited to 'notes/cleanups/wayback_timestamps.md')
-rw-r--r-- | notes/cleanups/wayback_timestamps.md | 45 |
1 files changed, 45 insertions, 0 deletions
diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md new file mode 100644 index 00000000..c70ec5b2 --- /dev/null +++ b/notes/cleanups/wayback_timestamps.md @@ -0,0 +1,45 @@ + +At some point, using the arabesque importer (from targetted crawling), we +accidentially imported a bunch of files with wayback URLs that have 12-digit +timestamps, instead of the full canonical 14-digit timestamps. + + +## Prep (2021-11-04) + +Download most recent file export: + + wget https://archive.org/download/fatcat_bulk_exports_2021-10-07/file_export.json.gz + +Filter to files with problem of interest: + + zcat file_export.json.gz \ + | pv -l \ + | rg 'web.archive.org/web/\d{12}/' \ + | gzip \ + > files_20211007_shortts.json.gz + # 111M 0:12:35 + + zcat files_20211007_shortts.json.gz | wc -l + # 7,935,009 + + zcat files_20211007_shortts.json.gz | shuf -n10000 > files_20211007_shortts.10k_sample.json + +Wow, this is a lot more than I thought! + +## Fetch Complete URL + +Want to export JSON like: + + file_entity + [existing file entity] + full_urls[] + <short>: <long> + status: str + +Status one of: + +- 'success-self': the file already has a fixed URL internally +- 'success-db': lookup URL against sandcrawler-db succeeded, and SHA1 matched +- 'success-cdx': CDX API lookup succeeded, and SHA1 matched +- 'fail-hash': found a CDX record, but wrong hash +- 'fail-not-found': no matching CDX record found |