blob: e555d46162d3d996498fef7e770f37b359ea998d (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
This could all be a single scalding job eventually.
First, run matchcrossref and dumpfilemeta, and copy the output down to an SSD
somewhere.
bnewbold@ia601101$ zcat 2018-09-14-0559.05-dumpfilemeta.tsv.gz | wc -l
30728100
Reduce down the scored matches to just {sha1, dois}, sorted:
zcat 2018-08-27-2352.17-matchcrossref.tsv.gz | ./filter_scored_matches.py | pv -l | sort > 2018-08-27-2352.17-matchcrossref.filtered.tsv
# 5.79M 0:18:54 [5.11k/s]
Join/merge the output:
zcat 2018-09-14-0559.05-dumpfilemeta.tsv.gz | LC_ALL=C join -t$'\t' 2018-08-27-2352.17-matchcrossref.filtered.tsv - | pv -l | enrich_scored_matches.py | gzip > 2018-08-27-2352.17-matchcrossref.insertable.json.gz
# 5.79M 0:09:09 [10.5k/s]
|