diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-09-14 19:18:14 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-09-14 19:18:14 -0700 |
commit | 710a0feab36f83eef21885ee7c23e5841cae1e87 (patch) | |
tree | 945d28de7c3c66ec9a3181fe9ce3389f1f12ff89 /notes | |
parent | 8e67baf622daa21ceca1b7cbf13f5461d9d8029a (diff) | |
download | sandcrawler-710a0feab36f83eef21885ee7c23e5841cae1e87.tar.gz sandcrawler-710a0feab36f83eef21885ee7c23e5841cae1e87.zip |
match and enrich notes+script
Diffstat (limited to 'notes')
-rw-r--r-- | notes/match_filter_enrich.txt | 19 |
1 files changed, 19 insertions, 0 deletions
diff --git a/notes/match_filter_enrich.txt b/notes/match_filter_enrich.txt new file mode 100644 index 0000000..e555d46 --- /dev/null +++ b/notes/match_filter_enrich.txt @@ -0,0 +1,19 @@ + +This could all be a single scalding job eventually. + +First, run matchcrossref and dumpfilemeta, and copy the output down to an SSD +somewhere. + + bnewbold@ia601101$ zcat 2018-09-14-0559.05-dumpfilemeta.tsv.gz | wc -l + 30728100 + +Reduce down the scored matches to just {sha1, dois}, sorted: + + zcat 2018-08-27-2352.17-matchcrossref.tsv.gz | ./filter_scored_matches.py | pv -l | sort > 2018-08-27-2352.17-matchcrossref.filtered.tsv + # 5.79M 0:18:54 [5.11k/s] + +Join/merge the output: + + zcat 2018-09-14-0559.05-dumpfilemeta.tsv.gz | LC_ALL=C join -t$'\t' 2018-08-27-2352.17-matchcrossref.filtered.tsv - | pv -l | enrich_scored_matches.py | gzip > 2018-08-27-2352.17-matchcrossref.insertable.json.gz + # 5.79M 0:09:09 [10.5k/s] + |