aboutsummaryrefslogtreecommitdiffstats
path: root/notes/2022-12_bulk_index.md
blob: b2c5cacfcb2b50217b32952560384a9009f6e83d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

Running through a full end-to-end re-indexing.


## Fatcat Metadata Dumps

Run following fatcat notes (elsewhere).

Download to working machine:

    export JOBDIR=/kubwa/fatcat/2022-11-24
    mkdir -p $JOBDIR
    cd $JOBDIR
    wget -c https://archive.org/download/fatcat_bulk_exports_2022-11-24/release_export_expanded.json.gz

## Microfilm

Working directory: `aitio:/fast/fatcat-scholar`. 

Pulled latest git (`00d80752b7d83ae5a165540fbad641ddfc78b5f3`), and ran `make
dep`.

Run:

    TODAY=2022-12-08 make issue-db

Then, the SIM dump job, in parallel:

    export JOBDIR=/kubwa/scholar/2022-12-08
    mkdir -p $JOBDIR
    pipenv shell
    python -m fatcat_scholar.sim_pipeline run_print_issues \
        | shuf \
        | parallel -j16 --colsep "\t" python -m fatcat_scholar.sim_pipeline run_fetch_issue {1} {2} \
        | pv -l \
        | pigz \
        > $JOBDIR/sim_intermediate.2022-12-08.json.gz
    => 45.4M 42:09:42 [ 298 /s]

TODO: there were some old publications that should not be included... gazetteer? registers?
    "Daily Gazetteer" (sim_daily-gazetteer)

## Works Bulk Fetch

First split up the release dump into chunks:

    export JOBDIR=/kubwa/scholar/2022-12-08
    mkdir -p $JOBDIR
    cd $JOBDIR
    zcat /kubwa/fatcat/2022-11-24/release_export_expanded.json.gz | split --lines 8000000 - release_export_expanded.split_ -d --additional-suffix .json
    => done

Note: more shards this time around (up to 23, not 21).

Starting the below commands on 2022-12-21.

    export JOBDIR=/kubwa/scholar/2022-12-08
    cd /fast/fatcat-scholar
    pipenv shell
    export TMPDIR=/sandcrawler-db/tmp
    # possibly re-export JOBDIR from above?

    # fetch
    set -u -o pipefail
    for SHARD in {00..23}; do
        cat $JOBDIR/release_export_expanded.split_$SHARD.json \
            | parallel -j8 --line-buffer --compress --tmpdir $TMPDIR --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
            | pv -l \
            | pigz \
            > $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP \
            && mv $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
    done