aboutsummaryrefslogtreecommitdiffstats
path: root/notes/crawl_cdx_merge.md
blob: a843a8deaa8b91cd2b75e718333a09468b722cba (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

## Old Way


Use metamgr to export an items list.

Get all the CDX files and merge/sort:

    mkdir CRAWL-2000 && cd CRAWL-2000
    cat ../CRAWL-2000.items | shuf | parallel --bar -j6 ia download {} {}.cdx.gz
    ls */*.cdx.gz | parallel --bar -j1 zcat {} > CRAWL-2000.unsorted.cdx
    sort -u CRAWL-2000.unsorted.cdx > CRAWL-2000.cdx
    wc -l CRAWL-2000.cdx
    rm CRAWL-2000.unsorted.cdx

    # gzip and upload to petabox, or send to HDFS, or whatever