diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-12-10 13:33:41 +0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-12-10 13:33:41 +0800 |
commit | 6e8305e625f8b033d2697d40ed31ec15368678f9 (patch) | |
tree | cec31f542750e922786a1e3bf8a6eb60529ab06e /notes | |
parent | 4736db1b1caca50a83bf7fb0d45e2e8d48d4e233 (diff) | |
download | sandcrawler-6e8305e625f8b033d2697d40ed31ec15368678f9.tar.gz sandcrawler-6e8305e625f8b033d2697d40ed31ec15368678f9.zip |
update notes
Diffstat (limited to 'notes')
-rw-r--r-- | notes/crawl_cdx_merge.md | 15 |
1 files changed, 14 insertions, 1 deletions
diff --git a/notes/crawl_cdx_merge.md b/notes/crawl_cdx_merge.md index a843a8d..1d744f5 100644 --- a/notes/crawl_cdx_merge.md +++ b/notes/crawl_cdx_merge.md @@ -1,6 +1,19 @@ -## Old Way +## New Way + +Run script from scratch repo: + + ~/scratch/bin/cdx_collection.py CRAWL-2000 + + zcat CRAWL-2000.cdx.gz | wc -l + # update crawl README/ANALYSIS/whatever + +Assuming we're just looking at PDFs: + + zcat CRAWL-2000.cdx.gz | rg -i pdf | sort -u | gzip > CRAWL-2000.sorted.cdx.gz + +## Old Way Use metamgr to export an items list. |