diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-05-09 17:47:58 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-05-09 17:47:58 -0700 |
commit | 9d518593633fac490b47f67544787454dc69f1bf (patch) | |
tree | 24fdaeb9086331b2020a67c3c66bf16c8212090e | |
parent | 27d149734439ee68738957df76cfb6f687b3f19b (diff) | |
download | sandcrawler-9d518593633fac490b47f67544787454dc69f1bf.tar.gz sandcrawler-9d518593633fac490b47f67544787454dc69f1bf.zip |
clearer CDX munge notes
-rw-r--r-- | notes/crawl_cdx_merge.md | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/notes/crawl_cdx_merge.md b/notes/crawl_cdx_merge.md index d2cffee..d330e9b 100644 --- a/notes/crawl_cdx_merge.md +++ b/notes/crawl_cdx_merge.md @@ -11,7 +11,7 @@ Run script from scratch repo: Assuming we're just looking at PDFs: - zcat CRAWL-2000.cdx.gz | rg -i pdf | sort -S 4G -u | gzip > CRAWL-2000.sorted.cdx.gz + zcat CRAWL-2000.cdx.gz | rg -i pdf | sort -S 4G -u > CRAWL-2000.sorted.cdx ## Old Way |