match_filter_enrich notes

author: Bryan Newbold <bnewbold@archive.org> 2019-01-03 14:00:14 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2019-01-03 14:00:14 -0800
commit: 75c4aa99448141ccb5f36528d3673e84f954e646 (patch)
tree: 9b1ffb3b919cbe57c5ecb9aef5976b79eb903513 /notes
parent: 86fd041f106ee8efc2c68ee8792389ebb05ae9ef (diff)
download: sandcrawler-75c4aa99448141ccb5f36528d3673e84f954e646.tar.gz
sandcrawler-75c4aa99448141ccb5f36528d3673e84f954e646.zip
1 files changed, 12 insertions, 0 deletions
diff --git a/notes/match_filter_enrich.txt b/notes/match_filter_enrich.txt
index 58d496b..0c9a2c3 100644
--- a/notes/match_filter_enrich.txt
+++ b/notes/match_filter_enrich.txt
@@ -17,3 +17,15 @@ Join/merge the output:
     zcat 2018-09-14-0559.05-dumpfilemeta.tsv.gz | LC_ALL=C join -t$'\t' 2018-08-27-2352.17-matchcrossref.filtered.tsv - | pv -l | ./enrich_scored_matches.py | gzip > 2018-08-27-2352.17-matchcrossref.insertable.json.gz
     # 5.79M 0:09:09 [10.5k/s]
 
+## Fatcat Insertable
+
+I can't remember now what the plan was for the 'insertable' output mode, which
+bundles {key, cdx, mime, and size} info along with the {slug, score, json1,
+json2} columns from the regular match script. The filter_scored_matches.py
+doesn't know what to do with those columns at the moment, and the output isn't
+sorted by slug... need to tweak scripts to fix this.
+
+In the meanwhile, as a work around just take the columns we want and resort:
+
+    export LC_ALL=C
+    zcat 2018-12-18-2237.09-matchcrossref.insertable.tsv.gz | cut -f2-5 | sort -u | gzip > 2018-12-18-2237.09-matchcrossref.tsv.gz
author	Bryan Newbold <bnewbold@archive.org>	2019-01-03 14:00:14 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2019-01-03 14:00:14 -0800
commit	75c4aa99448141ccb5f36528d3673e84f954e646 (patch)
tree	9b1ffb3b919cbe57c5ecb9aef5976b79eb903513 /notes
parent	86fd041f106ee8efc2c68ee8792389ebb05ae9ef (diff)
download	sandcrawler-75c4aa99448141ccb5f36528d3673e84f954e646.tar.gz sandcrawler-75c4aa99448141ccb5f36528d3673e84f954e646.zip