notes/2021-12_works.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189


## Bulk Fetch

Implemented HTTP sessions for postgrest updates, which could be a significant
performance improvement.

    export JOBDIR=/kubwa/scholar/2021-12-08
    mkdir -p $JOBDIR
    cd $JOBDIR
    zcat /kubwa/fatcat/2021-12-01/release_export_expanded.json.gz | split --lines 8000000 - release_export_expanded.split_ -d --additional-suffix .json

    cd /fast/fatcat-scholar
    git pull
    pipenv shell
    export TMPDIR=/sandcrawler-db/tmp
    # possibly re-export JOBDIR from above?

    # fetch
    set -u -o pipefail
    for SHARD in {00..21}; do
        cat $JOBDIR/release_export_expanded.split_$SHARD.json \
            | parallel -j8 --line-buffer --compress --tmpdir $TMPDIR --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
            | pv -l \
            | pigz \
            > $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP \
            && mv $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
    done

    # dump refs
    set -u -o pipefail
    for SHARD in {00..21}; do
        zcat $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz \
            | pv -l \
            | parallel -j12 --linebuffer --compress --tmpdir $TMPDIR --round-robin --pipe python -m fatcat_scholar.transform run_refs \
            | pigz \
            > $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz.WIP \
            && mv $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz.WIP $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz
    done

This entire progress took almost weeks, from 2021-12-09 to 2021-12-28. There
were some delays (datacenter power outage) which cost a couple days. Reference
dumping took about two days, the fetching probably could have completed in 14
days if run smoothly (no interruptions). Overall not bad though. This size of
shard seems to work well.

## Upload to archive.org

    export JOBDIR=/kubwa/scholar/2021-12-08
    export BASENAME=scholar_corpus_bundle_2021-12-08
    for SHARD in {00..21}; do
        ia upload ${BASENAME}_split-${SHARD} $JOBDIR/fatcat_scholar_work_fulltext.split_${SHARD}.json.gz -m collection:"scholarly-tdm" --checksum
    done

    ia upload scholar_corpus_refs_2021-12-08 fatcat_scholar_work_fulltext.split_*.refs.json.gz -m collection:"scholarly-tdm" --checksum

## Indexing (including SIM pages)

Where and how are we going to index? Total size of new scholar index is estimated
to be 2.5 TByte (current index is 2.3 TByte). Remember that we split scholar
across `svc500` and `svc097`. `svc500` has the scholar primary shards and
`svc097` has the replica shards.

One proposal is to drop the replicas from `svc097` and start indexing there;
the machine would have no other indices so disruption would be minimal. Might
also point load balancer to `svc500` as the primary.

Steps:

- stop `scholar-index-docs-worker@*` on `svc097`
- update haproxy config to have `svc500` and scholar primary, `svc097` as backup
- create scholar indexes

Running these commands on `wbgrp-svc097`:

    http put ":9200/scholar_fulltext_v01_20211208?include_type_name=true" < schema/scholar_fulltext.v01.json

    http put ":9200/scholar_fulltext_v01_20211208/_settings" index.routing.allocation.include._name=wbgrp-svc097

    # first SIM pages
    ssh aitio.us.archive.org cat /kubwa/scholar/2021-12-01/sim_intermediate.2021-12-01.json.gz \
      | gunzip \
      | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
      | pv -l \
      | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01_20211208 -type _doc \
      2> /tmp/error.txt 1> /tmp/output.txt
    => 41.8M 16:37:13 [ 698 /s]

    # then works
    ssh aitio.us.archive.org cat /kubwa/scholar/2021-12-08/fatcat_scholar_work_fulltext.split_{00..21}.json.gz \
        | gunzip \
        | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
        | pv -l \
        | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01_20211208 -type _doc \
        2> /tmp/error.txt 1> /tmp/output.txt

Part way through there was a power outage, and had to continue.

    # 2022/01/14 17:07:59 indexing failed with 503 Service Unavailable: {"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/2/no master];"}],"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/2/no master];"},"status":503}
    # 112M 72:11:17 [ 433 /s]
    # this means 00..13 finished successfully

    ssh aitio.us.archive.org cat /kubwa/scholar/2021-12-08/fatcat_scholar_work_fulltext.split_{14..21}.json.gz \
        | gunzip \
        | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
        | pv -l \
        | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01_20211208 -type _doc \
        2> /tmp/error2.txt 1> /tmp/output2.txt
    # ... following the above ...
    # 61.0M 40:51:18 [ 414 /s]

Changes while indexing:

- SIM indexing command failed at just 70k docs the first time, because of an
  issue with multiple publishers in item metadata. updated transform, ran a
  million pages through (to /dev/null) as testing, then restarted
- power outage happened, as noted above

Index size at this point (bulk indexing complete):

    http get :9200/_cat/indices | rg scholar
    green open scholar_fulltext_v01_20210128       OGyck2ppQhaTh6N-u87xSg 12 0  183923045 19144541   2.3tb   2.3tb
    green open scholar_fulltext_v01_20211208       _u2PE-oTRcSktI5mxDrQPg 12 0  212298463   553388     2tb     2tb

## Before/After Stats

Brainstorming:

- total, works, and `sim_pages` in index
- for works, break down of access types (SIM, web/archive.org)
- total public domain and public domain with access (new pre-1927 wall)
- sitemap size

Note: added commas to the below output, and summarized as "old" / "new" for the
two indices. remember that "old" index at this point had a couple months of
additional daily index results.

    http get :9200/scholar_fulltext_v01_20211208/_count | jq .count
    old: 183,923,045
    new: 212,298,463
         +28,375,418

    http get :9200/scholar_fulltext_v01_20210128/_count q=="doc_type:sim_page" | jq .count
    old: 10,448,586
    new: 40,559,249

    http get :9200/scholar_fulltext_v01_20210128/_count q=="doc_type:work" | jq .count
    old: 173,474,459
    new: 171,739,214

    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:*" | jq .count
    old: 44,357,232
    new: 74,840,284

    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:wayback" | jq .count
    old: 31,693,599
    new: 30,926,112
    new (final): 31,832,082

    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:ia_sim AND doc_type:work" | jq .count
    old:    51,118
    new: 1,189,974

    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:* AND year:<=1925" | jq .count
    old:  3,707,248
    new: 18,361,450

    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:* AND year:<=1927" | jq .count
    old:  3,837,502
    new: 18,850,927

    http get :9200/scholar_fulltext_v01_20211208/_count q=="fulltext.access_type:* AND doc_type:work AND year:<=1925" | jq .count
    old: 2,261,426
    new: 2,222,627

    http get :9200/scholar_fulltext_v01_20211208/_count q=="fulltext.access_type:* AND doc_type:work AND year:<=1927" | jq .count
    old: 2,288,760
    new: 2,268,425

    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:* AND doc_type:work" | jq .count
    old: 33,908,646
    new (final): 35,190,311

Sitemap size:

    cat sitemap-access-00*.txt | wc -l
    2021-06-23: 17,900,935
    2022-01-20: 23.9M 6:27:27 [1.03k/s]

    works 2022-01-20: 23.9M 6:28:50 [1.03k/s]