blob: 209fa4ffcb913003cf7fec17d2d61cd2d3f51ba6 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
Wanted to include a large number of additional CDX lines based on regex
pattern. These are primarily .edu domains with things that look like user
accounts *and* .pdf file extensions in the path.
## Commands
aitio:/fast/gwb_pdfs
pdfs/gwb-pdf-20191005172329-url-heuristics-edu
pdfs/gwb-pdf-20191005172329-url-heuristics
to filter as url/sha1 uniq:
cat raw.cdx | sort -u -t' ' -k3,6 -S 4G > uniq.cdx
cat gwb-pdf-20191005172329-url-heuristics-edu/part-r-000* | sort -u -t' ' -k3,6 -S 4G > gwb-pdf-20191005172329-url-heuristics-edu.uniq_url_sha1.cdx
cat gwb-pdf-20191005172329-url-heuristics/part-r-000* | sort -u -t' ' -k3,6 -S 4G > gwb-pdf-20191005172329-url-heuristics.uniq_url_sha1.cdx
7241795 gwb-pdf-20191005172329-url-heuristics-edu.uniq_url_sha1.cdx
41137888 gwb-pdf-20191005172329-url-heuristics.uniq_url_sha1.cdx
cut -d' ' -f6 gwb-pdf-20191005172329-url-heuristics-edu.uniq_url_sha1.cdx | sort -u -S 4G | wc -l
7241795
cut -d' ' -f6 gwb-pdf-20191005172329-url-heuristics.uniq_url_sha1.cdx | sort -u -S 4G | wc -l
41137888
./persist_tool.py cdx /fast/gwb_pdf/gwb-pdf-20191005172329-url-heuristics-edu.uniq_url_sha1.cdx
Worker: Counter({'total': 7239153, 'insert-cdx': 6845283, 'update-cdx': 0})
CDX lines pushed: Counter({'total': 7241795, 'pushed': 7239153, 'skip-parse': 2603, 'skip-mimetype': 39})
./persist_tool.py cdx /fast/gwb_pdf/gwb-pdf-20191005172329-url-heuristics.uniq_url_sha1.cdx
Worker: Counter({'total': 41030360, 'insert-cdx': 22430064, 'update-cdx': 0})
CDX lines pushed: Counter({'total': 41137888, 'pushed': 41030360, 'skip-mimetype': 87341, 'skip-parse': 20187})
|