summaryrefslogtreecommitdiffstats
path: root/notes/cleanups/wayback_timestamps.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/cleanups/wayback_timestamps.md')
-rw-r--r--notes/cleanups/wayback_timestamps.md45
1 files changed, 45 insertions, 0 deletions
diff --git a/notes/cleanups/wayback_timestamps.md b/notes/cleanups/wayback_timestamps.md
new file mode 100644
index 00000000..c70ec5b2
--- /dev/null
+++ b/notes/cleanups/wayback_timestamps.md
@@ -0,0 +1,45 @@
+
+At some point, using the arabesque importer (from targetted crawling), we
+accidentially imported a bunch of files with wayback URLs that have 12-digit
+timestamps, instead of the full canonical 14-digit timestamps.
+
+
+## Prep (2021-11-04)
+
+Download most recent file export:
+
+ wget https://archive.org/download/fatcat_bulk_exports_2021-10-07/file_export.json.gz
+
+Filter to files with problem of interest:
+
+ zcat file_export.json.gz \
+ | pv -l \
+ | rg 'web.archive.org/web/\d{12}/' \
+ | gzip \
+ > files_20211007_shortts.json.gz
+ # 111M 0:12:35
+
+ zcat files_20211007_shortts.json.gz | wc -l
+ # 7,935,009
+
+ zcat files_20211007_shortts.json.gz | shuf -n10000 > files_20211007_shortts.10k_sample.json
+
+Wow, this is a lot more than I thought!
+
+## Fetch Complete URL
+
+Want to export JSON like:
+
+ file_entity
+ [existing file entity]
+ full_urls[]
+ <short>: <long>
+ status: str
+
+Status one of:
+
+- 'success-self': the file already has a fixed URL internally
+- 'success-db': lookup URL against sandcrawler-db succeeded, and SHA1 matched
+- 'success-cdx': CDX API lookup succeeded, and SHA1 matched
+- 'fail-hash': found a CDX record, but wrong hash
+- 'fail-not-found': no matching CDX record found