From 75c4aa99448141ccb5f36528d3673e84f954e646 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 3 Jan 2019 14:00:14 -0800 Subject: match_filter_enrich notes --- notes/match_filter_enrich.txt | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'notes') diff --git a/notes/match_filter_enrich.txt b/notes/match_filter_enrich.txt index 58d496b..0c9a2c3 100644 --- a/notes/match_filter_enrich.txt +++ b/notes/match_filter_enrich.txt @@ -17,3 +17,15 @@ Join/merge the output: zcat 2018-09-14-0559.05-dumpfilemeta.tsv.gz | LC_ALL=C join -t$'\t' 2018-08-27-2352.17-matchcrossref.filtered.tsv - | pv -l | ./enrich_scored_matches.py | gzip > 2018-08-27-2352.17-matchcrossref.insertable.json.gz # 5.79M 0:09:09 [10.5k/s] +## Fatcat Insertable + +I can't remember now what the plan was for the 'insertable' output mode, which +bundles {key, cdx, mime, and size} info along with the {slug, score, json1, +json2} columns from the regular match script. The filter_scored_matches.py +doesn't know what to do with those columns at the moment, and the output isn't +sorted by slug... need to tweak scripts to fix this. + +In the meanwhile, as a work around just take the columns we want and resort: + + export LC_ALL=C + zcat 2018-12-18-2237.09-matchcrossref.insertable.tsv.gz | cut -f2-5 | sort -u | gzip > 2018-12-18-2237.09-matchcrossref.tsv.gz -- cgit v1.2.3