aboutsummaryrefslogtreecommitdiffstats
path: root/extra/partition_dumps/README.md
blob: 463bf42d2c32a627d817f72ef05de625312ea50f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

This script is used to "partition" (split up) a complete JSON dump by some key.
For example, split release dump JSON lines into separate files, one per
journal/container.

Example parititoning a sample by release type:

    cat release_export_expanded_sample.json | jq .release_type -r > release_export_expanded_sample.release_type
    cat release_export_expanded_sample.release_type | sort -S 4G | uniq -c | sort -S 500M -nr > release_export_expanded_sample.release_type.counts
    cat release_export_expanded_sample.json | paste release_export_expanded_sample.release_type - | sort -S 4G > out

More production-y example using ISSN-L:

    # will append otherwise
    rm -rf ./partitioned

    # it's a pretty huge sort, will need 300+ GB scratch space? this might not scale.
    zcat release_export_expanded.json.gz | jq .container.issnl -r > release_export_expanded.issnl
    zcat release_export_expanded.json.gz | paste release_export_expanded.issnl - | sort -S 8G | ./partition_script.py

    # for verification/stats
    cat release_export_expanded.issnl | sort -S 1G | uniq -c | sort -S 1G -nr > release_export_expanded.issnl.counts
    
    # cleanup
    rm release_export_expanded.issnl