blob: d330e9ba3831c066022a21de8af7a92fb720ef45 (
plain)
| 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
 | 
## New Way
Run script from scratch repo:
    ~/scratch/bin/cdx_collection.py CRAWL-2000
    zcat CRAWL-2000.cdx.gz | wc -l
    # update crawl README/ANALYSIS/whatever
Assuming we're just looking at PDFs:
    zcat CRAWL-2000.cdx.gz | rg -i pdf | sort -S 4G -u > CRAWL-2000.sorted.cdx
## Old Way
Use metamgr to export an items list.
Get all the CDX files and merge/sort:
    mkdir CRAWL-2000 && cd CRAWL-2000
    cat ../CRAWL-2000.items | shuf | parallel --bar -j6 ia download {} {}.cdx.gz
    ls */*.cdx.gz | parallel --bar -j1 zcat {} > CRAWL-2000.unsorted.cdx
    sort -S 4G -u CRAWL-2000.unsorted.cdx > CRAWL-2000.cdx
    wc -l CRAWL-2000.cdx
    rm CRAWL-2000.unsorted.cdx
    # gzip and upload to petabox, or send to HDFS, or whatever
 |