aboutsummaryrefslogtreecommitdiffstats
path: root/notes/crawl_cdx_merge.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/crawl_cdx_merge.md')
-rw-r--r--notes/crawl_cdx_merge.md15
1 files changed, 14 insertions, 1 deletions
diff --git a/notes/crawl_cdx_merge.md b/notes/crawl_cdx_merge.md
index a843a8d..1d744f5 100644
--- a/notes/crawl_cdx_merge.md
+++ b/notes/crawl_cdx_merge.md
@@ -1,6 +1,19 @@
-## Old Way
+## New Way
+
+Run script from scratch repo:
+
+ ~/scratch/bin/cdx_collection.py CRAWL-2000
+
+ zcat CRAWL-2000.cdx.gz | wc -l
+ # update crawl README/ANALYSIS/whatever
+
+Assuming we're just looking at PDFs:
+
+ zcat CRAWL-2000.cdx.gz | rg -i pdf | sort -u | gzip > CRAWL-2000.sorted.cdx.gz
+
+## Old Way
Use metamgr to export an items list.