aboutsummaryrefslogtreecommitdiffstats
path: root/notes/match_filter_enrich.txt
blob: 0c1f7dfc9f6420279d1806abba29ed9e0481bc38 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

This could all be a single scalding job eventually.

First, run matchcrossref and dumpfilemeta, and copy the output down to an SSD
somewhere.

    bnewbold@ia601101$ zcat 2018-09-14-0559.05-dumpfilemeta.tsv.gz | wc -l
    30728100

Reduce down the scored matches to just {sha1, dois}, sorted:

    zcat 2018-08-27-2352.17-matchcrossref.tsv.gz | ./filter_scored_matches.py | pv -l | sort -S 8G > 2018-08-27-2352.17-matchcrossref.filtered.tsv
    # 5.79M 0:18:54 [5.11k/s]

Join/merge the output:

    zcat 2018-09-14-0559.05-dumpfilemeta.tsv.gz | LC_ALL=C join -t$'\t' 2018-08-27-2352.17-matchcrossref.filtered.tsv - | pv -l | ./enrich_scored_matches.py | gzip > 2018-08-27-2352.17-matchcrossref.insertable.json.gz
    # 5.79M 0:09:09 [10.5k/s]

## Fatcat Insertable

I can't remember now what the plan was for the 'insertable' output mode, which
bundles {key, cdx, mime, and size} info along with the {slug, score, json1,
json2} columns from the regular match script. The filter_scored_matches.py
doesn't know what to do with those columns at the moment, and the output isn't
sorted by slug... need to tweak scripts to fix this.

In the meanwhile, as a work around just take the columns we want and re-sort:

    export LC_ALL=C
    zcat 2018-12-18-2237.09-matchcrossref.insertable.tsv.gz | cut -f2-5 | sort -S 8G -u | gzip > 2018-12-18-2237.09-matchcrossref.tsv.gz