aboutsummaryrefslogtreecommitdiffstats
path: root/notes/tasks/2020-01-06_heuristic_cdx.txt
blob: 209fa4ffcb913003cf7fec17d2d61cd2d3f51ba6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

Wanted to include a large number of additional CDX lines based on regex
pattern. These are primarily .edu domains with things that look like user
accounts *and* .pdf file extensions in the path.

## Commands

aitio:/fast/gwb_pdfs

  pdfs/gwb-pdf-20191005172329-url-heuristics-edu
  pdfs/gwb-pdf-20191005172329-url-heuristics


to filter as url/sha1 uniq:

    cat raw.cdx | sort -u -t' ' -k3,6 -S 4G > uniq.cdx

    cat gwb-pdf-20191005172329-url-heuristics-edu/part-r-000* | sort -u -t' ' -k3,6 -S 4G > gwb-pdf-20191005172329-url-heuristics-edu.uniq_url_sha1.cdx
    cat gwb-pdf-20191005172329-url-heuristics/part-r-000* | sort -u -t' ' -k3,6 -S 4G > gwb-pdf-20191005172329-url-heuristics.uniq_url_sha1.cdx

    7241795  gwb-pdf-20191005172329-url-heuristics-edu.uniq_url_sha1.cdx
    41137888 gwb-pdf-20191005172329-url-heuristics.uniq_url_sha1.cdx

    cut -d' ' -f6 gwb-pdf-20191005172329-url-heuristics-edu.uniq_url_sha1.cdx | sort -u -S 4G | wc -l
    7241795

    cut -d' ' -f6 gwb-pdf-20191005172329-url-heuristics.uniq_url_sha1.cdx | sort -u -S 4G | wc -l
    41137888

    ./persist_tool.py cdx /fast/gwb_pdf/gwb-pdf-20191005172329-url-heuristics-edu.uniq_url_sha1.cdx
    Worker: Counter({'total': 7239153, 'insert-cdx': 6845283, 'update-cdx': 0})
    CDX lines pushed: Counter({'total': 7241795, 'pushed': 7239153, 'skip-parse': 2603, 'skip-mimetype': 39})

    ./persist_tool.py cdx /fast/gwb_pdf/gwb-pdf-20191005172329-url-heuristics.uniq_url_sha1.cdx
    Worker: Counter({'total': 41030360, 'insert-cdx': 22430064, 'update-cdx': 0})
    CDX lines pushed: Counter({'total': 41137888, 'pushed': 41030360, 'skip-mimetype': 87341, 'skip-parse': 20187})