summaryrefslogtreecommitdiffstats
path: root/extra/partition_dumps
diff options
context:
space:
mode:
Diffstat (limited to 'extra/partition_dumps')
-rw-r--r--extra/partition_dumps/README.md8
1 files changed, 4 insertions, 4 deletions
diff --git a/extra/partition_dumps/README.md b/extra/partition_dumps/README.md
index 5e42ff48..463bf42d 100644
--- a/extra/partition_dumps/README.md
+++ b/extra/partition_dumps/README.md
@@ -6,8 +6,8 @@ journal/container.
Example parititoning a sample by release type:
cat release_export_expanded_sample.json | jq .release_type -r > release_export_expanded_sample.release_type
- cat release_export_expanded_sample.release_type | sort | uniq -c | sort -nr > release_export_expanded_sample.release_type.counts
- cat release_export_expanded_sample.json | paste release_export_expanded_sample.release_type - | sort > out
+ cat release_export_expanded_sample.release_type | sort -S 4G | uniq -c | sort -S 500M -nr > release_export_expanded_sample.release_type.counts
+ cat release_export_expanded_sample.json | paste release_export_expanded_sample.release_type - | sort -S 4G > out
More production-y example using ISSN-L:
@@ -16,10 +16,10 @@ More production-y example using ISSN-L:
# it's a pretty huge sort, will need 300+ GB scratch space? this might not scale.
zcat release_export_expanded.json.gz | jq .container.issnl -r > release_export_expanded.issnl
- zcat release_export_expanded.json.gz | paste release_export_expanded.issnl - | sort | ./partition_script.py
+ zcat release_export_expanded.json.gz | paste release_export_expanded.issnl - | sort -S 8G | ./partition_script.py
# for verification/stats
- cat release_export_expanded.issnl | sort | uniq -c | sort -nr > release_export_expanded.issnl.counts
+ cat release_export_expanded.issnl | sort -S 1G | uniq -c | sort -S 1G -nr > release_export_expanded.issnl.counts
# cleanup
rm release_export_expanded.issnl