aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-12-10 13:33:41 +0800
committerBryan Newbold <bnewbold@archive.org>2018-12-10 13:33:41 +0800
commit6e8305e625f8b033d2697d40ed31ec15368678f9 (patch)
treecec31f542750e922786a1e3bf8a6eb60529ab06e /notes
parent4736db1b1caca50a83bf7fb0d45e2e8d48d4e233 (diff)
downloadsandcrawler-6e8305e625f8b033d2697d40ed31ec15368678f9.tar.gz
sandcrawler-6e8305e625f8b033d2697d40ed31ec15368678f9.zip
update notes
Diffstat (limited to 'notes')
-rw-r--r--notes/crawl_cdx_merge.md15
1 files changed, 14 insertions, 1 deletions
diff --git a/notes/crawl_cdx_merge.md b/notes/crawl_cdx_merge.md
index a843a8d..1d744f5 100644
--- a/notes/crawl_cdx_merge.md
+++ b/notes/crawl_cdx_merge.md
@@ -1,6 +1,19 @@
-## Old Way
+## New Way
+
+Run script from scratch repo:
+
+ ~/scratch/bin/cdx_collection.py CRAWL-2000
+
+ zcat CRAWL-2000.cdx.gz | wc -l
+ # update crawl README/ANALYSIS/whatever
+
+Assuming we're just looking at PDFs:
+
+ zcat CRAWL-2000.cdx.gz | rg -i pdf | sort -u | gzip > CRAWL-2000.sorted.cdx.gz
+
+## Old Way
Use metamgr to export an items list.