notes on re-indexing

author: Bryan Newbold <bnewbold@archive.org> 2022-01-20 15:02:18 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2022-01-20 15:02:18 -0800
commit: 6b8d310de42edab4d239de64de4007429a2b454e (patch)
tree: f5a2c1f0fa51d5830d42658d4e75fd9787623d1c /notes
parent: 694ecdf6ecf34f999bb4d862faef26a77aa970a1 (diff)
download: fatcat-scholar-6b8d310de42edab4d239de64de4007429a2b454e.tar.gz
fatcat-scholar-6b8d310de42edab4d239de64de4007429a2b454e.zip
1 files changed, 189 insertions, 0 deletions
diff --git a/notes/2021-12_works.md b/notes/2021-12_works.md
new file mode 100644
index 0000000..1361387
--- /dev/null
+++ b/notes/2021-12_works.md
@@ -0,0 +1,189 @@
+
+## Bulk Fetch
+
+Implemented HTTP sessions for postgrest updates, which could be a significant
+performance improvement.
+
+    export JOBDIR=/kubwa/scholar/2021-12-08
+    mkdir -p $JOBDIR
+    cd $JOBDIR
+    zcat /kubwa/fatcat/2021-12-01/release_export_expanded.json.gz | split --lines 8000000 - release_export_expanded.split_ -d --additional-suffix .json
+
+    cd /fast/fatcat-scholar
+    git pull
+    pipenv shell
+    export TMPDIR=/sandcrawler-db/tmp
+    # possibly re-export JOBDIR from above?
+
+    # fetch
+    set -u -o pipefail
+    for SHARD in {00..21}; do
+        cat $JOBDIR/release_export_expanded.split_$SHARD.json \
+            | parallel -j8 --line-buffer --compress --tmpdir $TMPDIR --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
+            | pv -l \
+            | pigz \
+            > $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP \
+            && mv $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz.WIP $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
+    done
+
+    # dump refs
+    set -u -o pipefail
+    for SHARD in {00..21}; do
+        zcat $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.json.gz \
+            | pv -l \
+            | parallel -j12 --linebuffer --compress --tmpdir $TMPDIR --round-robin --pipe python -m fatcat_scholar.transform run_refs \
+            | pigz \
+            > $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz.WIP \
+            && mv $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz.WIP $JOBDIR/fatcat_scholar_work_fulltext.split_$SHARD.refs.json.gz
+    done
+
+This entire progress took almost weeks, from 2021-12-09 to 2021-12-28. There
+were some delays (datacenter power outage) which cost a couple days. Reference
+dumping took about two days, the fetching probably could have completed in 14
+days if run smoothly (no interruptions). Overall not bad though. This size of
+shard seems to work well.
+
+## Upload to archive.org
+
+    export JOBDIR=/kubwa/scholar/2021-12-08
+    export BASENAME=scholar_corpus_bundle_2021-12-08
+    for SHARD in {00..21}; do
+        ia upload ${BASENAME}_split-${SHARD} $JOBDIR/fatcat_scholar_work_fulltext.split_${SHARD}.json.gz -m collection:"scholarly-tdm" --checksum
+    done
+
+    ia upload scholar_corpus_refs_2021-12-08 fatcat_scholar_work_fulltext.split_*.refs.json.gz -m collection:"scholarly-tdm" --checksum
+
+## Indexing (including SIM pages)
+
+Where and how are we going to index? Total size of new scholar index is estimated
+to be 2.5 TByte (current index is 2.3 TByte). Remember that we split scholar
+across `svc500` and `svc097`. `svc500` has the scholar primary shards and
+`svc097` has the replica shards.
+
+One proposal is to drop the replicas from `svc097` and start indexing there;
+the machine would have no other indices so disruption would be minimal. Might
+also point load balancer to `svc500` as the primary.
+
+Steps:
+
+- stop `scholar-index-docs-worker@*` on `svc097`
+- update haproxy config to have `svc500` and scholar primary, `svc097` as backup
+- create scholar indexes
+
+Running these commands on `wbgrp-svc097`:
+
+    http put ":9200/scholar_fulltext_v01_20211208?include_type_name=true" < schema/scholar_fulltext.v01.json
+
+    http put ":9200/scholar_fulltext_v01_20211208/_settings" index.routing.allocation.include._name=wbgrp-svc097
+
+    # first SIM pages
+    ssh aitio.us.archive.org cat /kubwa/scholar/2021-12-01/sim_intermediate.2021-12-01.json.gz \
+      | gunzip \
+      | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
+      | pv -l \
+      | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01_20211208 -type _doc \
+      2> /tmp/error.txt 1> /tmp/output.txt
+    => 41.8M 16:37:13 [ 698 /s]
+
+    # then works
+    ssh aitio.us.archive.org cat /kubwa/scholar/2021-12-08/fatcat_scholar_work_fulltext.split_{00..21}.json.gz \
+        | gunzip \
+        | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
+        | pv -l \
+        | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01_20211208 -type _doc \
+        2> /tmp/error.txt 1> /tmp/output.txt
+
+Part way through there was a power outage, and had to continue.
+
+    # 2022/01/14 17:07:59 indexing failed with 503 Service Unavailable: {"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/2/no master];"}],"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/2/no master];"},"status":503}
+    # 112M 72:11:17 [ 433 /s]
+    # this means 00..13 finished successfully
+
+    ssh aitio.us.archive.org cat /kubwa/scholar/2021-12-08/fatcat_scholar_work_fulltext.split_{14..21}.json.gz \
+        | gunzip \
+        | sudo -u fatcat parallel -j8 --compress --tmpdir /srv/tmp/ --line-buffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
+        | pv -l \
+        | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01_20211208 -type _doc \
+        2> /tmp/error2.txt 1> /tmp/output2.txt
+    # ... following the above ...
+    # 61.0M 40:51:18 [ 414 /s]
+
+Changes while indexing:
+
+- SIM indexing command failed at just 70k docs the first time, because of an
+  issue with multiple publishers in item metadata. updated transform, ran a
+  million pages through (to /dev/null) as testing, then restarted
+- power outage happened, as noted above
+
+Index size at this point (bulk indexing complete):
+
+    http get :9200/_cat/indices | rg scholar
+    green open scholar_fulltext_v01_20210128       OGyck2ppQhaTh6N-u87xSg 12 0  183923045 19144541   2.3tb   2.3tb
+    green open scholar_fulltext_v01_20211208       _u2PE-oTRcSktI5mxDrQPg 12 0  212298463   553388     2tb     2tb
+
+## Before/After Stats
+
+Brainstorming:
+
+- total, works, and `sim_pages` in index
+- for works, break down of access types (SIM, web/archive.org)
+- total public domain and public domain with access (new pre-1927 wall)
+- sitemap size
+
+Note: added commas to the below output, and summarized as "old" / "new" for the
+two indices. remember that "old" index at this point had a couple months of
+additional daily index results.
+
+    http get :9200/scholar_fulltext_v01_20211208/_count | jq .count
+    old: 183,923,045
+    new: 212,298,463
+         +28,375,418
+
+    http get :9200/scholar_fulltext_v01_20210128/_count q=="doc_type:sim_page" | jq .count
+    old: 10,448,586
+    new: 40,559,249
+
+    http get :9200/scholar_fulltext_v01_20210128/_count q=="doc_type:work" | jq .count
+    old: 173,474,459
+    new: 171,739,214
+
+    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:*" | jq .count
+    old: 44,357,232
+    new: 74,840,284
+
+    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:wayback" | jq .count
+    old: 31,693,599
+    new: 30,926,112
+    new (final): 31,832,082
+
+    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:ia_sim AND doc_type:work" | jq .count
+    old:    51,118
+    new: 1,189,974
+
+    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:* AND year:<=1925" | jq .count
+    old:  3,707,248
+    new: 18,361,450
+
+    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:* AND year:<=1927" | jq .count
+    old:  3,837,502
+    new: 18,850,927
+
+    http get :9200/scholar_fulltext_v01_20211208/_count q=="fulltext.access_type:* AND doc_type:work AND year:<=1925" | jq .count
+    old: 2,261,426
+    new: 2,222,627
+
+    http get :9200/scholar_fulltext_v01_20211208/_count q=="fulltext.access_type:* AND doc_type:work AND year:<=1927" | jq .count
+    old: 2,288,760
+    new: 2,268,425
+
+    http get :9200/scholar_fulltext_v01_20210128/_count q=="fulltext.access_type:* AND doc_type:work" | jq .count
+    old: 33,908,646
+    new (final): 35,190,311
+
+Sitemap size:
+
+    cat sitemap-access-00*.txt | wc -l
+    2021-06-23: 17,900,935
+    2022-01-20: 23.9M 6:27:27 [1.03k/s]
+
+    works 2022-01-20: 23.9M 6:28:50 [1.03k/s]
author	Bryan Newbold <bnewbold@archive.org>	2022-01-20 15:02:18 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2022-01-20 15:02:18 -0800
commit	6b8d310de42edab4d239de64de4007429a2b454e (patch)
tree	f5a2c1f0fa51d5830d42658d4e75fd9787623d1c /notes
parent	694ecdf6ecf34f999bb4d862faef26a77aa970a1 (diff)
download	fatcat-scholar-6b8d310de42edab4d239de64de4007429a2b454e.tar.gz fatcat-scholar-6b8d310de42edab4d239de64de4007429a2b454e.zip