blob: a843a8deaa8b91cd2b75e718333a09468b722cba (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
## Old Way
Use metamgr to export an items list.
Get all the CDX files and merge/sort:
mkdir CRAWL-2000 && cd CRAWL-2000
cat ../CRAWL-2000.items | shuf | parallel --bar -j6 ia download {} {}.cdx.gz
ls */*.cdx.gz | parallel --bar -j1 zcat {} > CRAWL-2000.unsorted.cdx
sort -u CRAWL-2000.unsorted.cdx > CRAWL-2000.cdx
wc -l CRAWL-2000.cdx
rm CRAWL-2000.unsorted.cdx
# gzip and upload to petabox, or send to HDFS, or whatever
|