blob: a843a8deaa8b91cd2b75e718333a09468b722cba (
plain)
| 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
 | 
## Old Way
Use metamgr to export an items list.
Get all the CDX files and merge/sort:
    mkdir CRAWL-2000 && cd CRAWL-2000
    cat ../CRAWL-2000.items | shuf | parallel --bar -j6 ia download {} {}.cdx.gz
    ls */*.cdx.gz | parallel --bar -j1 zcat {} > CRAWL-2000.unsorted.cdx
    sort -u CRAWL-2000.unsorted.cdx > CRAWL-2000.cdx
    wc -l CRAWL-2000.cdx
    rm CRAWL-2000.unsorted.cdx
    # gzip and upload to petabox, or send to HDFS, or whatever
 |