aboutsummaryrefslogtreecommitdiffstats
path: root/notes/match_filter_enrich.txt
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-02-01 15:13:32 -0800
committerBryan Newbold <bnewbold@archive.org>2019-02-01 15:13:32 -0800
commit52967e05d2c8febdaa0426634fa987eaf5f58577 (patch)
treeda12fd9c2f1ea3d517246a60dbc1467eb0ad748f /notes/match_filter_enrich.txt
parent8901138485d1da4eb9a2512268faaa27fdf567c5 (diff)
downloadsandcrawler-52967e05d2c8febdaa0426634fa987eaf5f58577.tar.gz
sandcrawler-52967e05d2c8febdaa0426634fa987eaf5f58577.zip
give sort way more RAM by default
Diffstat (limited to 'notes/match_filter_enrich.txt')
-rw-r--r--notes/match_filter_enrich.txt6
1 files changed, 3 insertions, 3 deletions
diff --git a/notes/match_filter_enrich.txt b/notes/match_filter_enrich.txt
index 0c9a2c3..0c1f7df 100644
--- a/notes/match_filter_enrich.txt
+++ b/notes/match_filter_enrich.txt
@@ -9,7 +9,7 @@ somewhere.
Reduce down the scored matches to just {sha1, dois}, sorted:
- zcat 2018-08-27-2352.17-matchcrossref.tsv.gz | ./filter_scored_matches.py | pv -l | sort > 2018-08-27-2352.17-matchcrossref.filtered.tsv
+ zcat 2018-08-27-2352.17-matchcrossref.tsv.gz | ./filter_scored_matches.py | pv -l | sort -S 8G > 2018-08-27-2352.17-matchcrossref.filtered.tsv
# 5.79M 0:18:54 [5.11k/s]
Join/merge the output:
@@ -25,7 +25,7 @@ json2} columns from the regular match script. The filter_scored_matches.py
doesn't know what to do with those columns at the moment, and the output isn't
sorted by slug... need to tweak scripts to fix this.
-In the meanwhile, as a work around just take the columns we want and resort:
+In the meanwhile, as a work around just take the columns we want and re-sort:
export LC_ALL=C
- zcat 2018-12-18-2237.09-matchcrossref.insertable.tsv.gz | cut -f2-5 | sort -u | gzip > 2018-12-18-2237.09-matchcrossref.tsv.gz
+ zcat 2018-12-18-2237.09-matchcrossref.insertable.tsv.gz | cut -f2-5 | sort -S 8G -u | gzip > 2018-12-18-2237.09-matchcrossref.tsv.gz