blob: d2cffee3403339cb0aa1218dc2e2c5e6d6f2b3fa (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
## New Way
Run script from scratch repo:
~/scratch/bin/cdx_collection.py CRAWL-2000
zcat CRAWL-2000.cdx.gz | wc -l
# update crawl README/ANALYSIS/whatever
Assuming we're just looking at PDFs:
zcat CRAWL-2000.cdx.gz | rg -i pdf | sort -S 4G -u | gzip > CRAWL-2000.sorted.cdx.gz
## Old Way
Use metamgr to export an items list.
Get all the CDX files and merge/sort:
mkdir CRAWL-2000 && cd CRAWL-2000
cat ../CRAWL-2000.items | shuf | parallel --bar -j6 ia download {} {}.cdx.gz
ls */*.cdx.gz | parallel --bar -j1 zcat {} > CRAWL-2000.unsorted.cdx
sort -S 4G -u CRAWL-2000.unsorted.cdx > CRAWL-2000.cdx
wc -l CRAWL-2000.cdx
rm CRAWL-2000.unsorted.cdx
# gzip and upload to petabox, or send to HDFS, or whatever
|