summaryrefslogtreecommitdiffstats
path: root/notes/scaling_works.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-08-12 14:50:00 -0700
committerBryan Newbold <bnewbold@archive.org>2020-08-12 14:50:00 -0700
commit10c0de41d0e8be11527124f5ae7ec8d498ad85ac (patch)
tree2458a7739c1ec9e180c20f53bb4d6de73a68ec84 /notes/scaling_works.md
parent7603dd0ade23e22197acd1fd1d35962c314cf797 (diff)
downloadfatcat-scholar-10c0de41d0e8be11527124f5ae7ec8d498ad85ac.tar.gz
fatcat-scholar-10c0de41d0e8be11527124f5ae7ec8d498ad85ac.zip
more scaling notes
Diffstat (limited to 'notes/scaling_works.md')
-rw-r--r--notes/scaling_works.md97
1 files changed, 97 insertions, 0 deletions
diff --git a/notes/scaling_works.md b/notes/scaling_works.md
index 82fd457..ac8645d 100644
--- a/notes/scaling_works.md
+++ b/notes/scaling_works.md
@@ -460,3 +460,100 @@ Performance seems to have diverged between the two instances, not sure why.
Maybe some query terms just randomly are faster on one instance or the other?
Eg, "wood"
+## 2020-08-07 Test Phrase Indexing
+
+Indexing 1 million papers twice, with old and new schema, to check impact of
+phrase indexing, in ES 7.x.
+
+ release_export.2019-07-07.5mil_fulltext.json.gz
+
+ git checkout 0c7a2ace5d7c5b357dd4afa708a07e3fa85849fd
+ http put ":9200/qa_scholar_fulltext_0c7a2ace?include_type_name=true" < schema/scholar_fulltext.v01.json
+ ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.20200723_two.json.gz \
+ | gunzip \
+ | head -n1000000 \
+ | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
+ | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_0c7a2ace -type _doc
+
+ # master branch, phrase indexing
+ git checkout 2c681e32756538c84b292cc95b623ee9758846a6
+ http put ":9200/qa_scholar_fulltext_2c681e327?include_type_name=true" < schema/scholar_fulltext.v01.json
+ ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.20200723_two.json.gz \
+ | gunzip \
+ | head -n1000000 \
+ | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
+ | esbulk -verbose -size 100 -id key -w 4 -index qa_scholar_fulltext_2c681e327 -type _doc
+
+ http get :9200/_cat/indices
+ [...]
+ green open qa_scholar_fulltext_0c7a2ace BQ9tH5OZT0evFCXiIJMdUQ 12 0 1000000 0 6.7gb 6.7gb
+ green open qa_scholar_fulltext_2c681e327 PgRMn5v-ReWzGlCTiP7b6g 12 0 1000000 0 9.5gb 9.5gb
+ [...]
+
+So phrase indexing is...42% larger index on disk, even with other changes to
+reduce size. We will probably approach 2 TByte total index size.
+
+ "to be or not to be"
+ => qa_scholar_fulltext_0c7a2ace: 65 Hits in 0.2sec (after repetitions)
+ => qa_scholar_fulltext_2c681e327: 65 Hits in 0.065sec
+
+ to be or not to be
+ => qa_scholar_fulltext_0c7a2ace: 87,586 Hits in 0.16sec
+ => qa_scholar_fulltext_2c681e327: 87,590 Hits in 0.16sec
+
+ "Besides all beneficial properties studied for various LAB, a special attention need to be pay on the possible cytotoxicity levels of the expressed bacteriocins"
+ => qa_scholar_fulltext_0c7a2ace: 1 Hits in 0.076sec
+ => qa_scholar_fulltext_2c681e327: 1 Hits in 0.055sec
+
+ "insect swarm"
+ => qa_scholar_fulltext_0c7a2ace: 4 Hits in 0.032sec
+ => qa_scholar_fulltext_2c681e327: 4 Hits in 0.024sec
+
+ "how to"
+ => qa_scholar_fulltext_0c7a2ace: 15,761 Hits in 0.11sec
+ => qa_scholar_fulltext_2c681e327: 15,763 Hits in 0.054sec
+
+Sort of splitting hairs at this scale, but does seem like phrase indexing helps
+with some queries. Seems worth at least trying with large/full index.
+
+## 2020-08-07 Iterated Release Batch
+
+Sharded indexing:
+
+ zcat /fast/download/release_export_expanded.2020-08-05.json.gz | split --lines 25000000 - release_export_expanded.split_ -d --additional-suffix .json
+
+ export TMPDIR=/sandcrawler-db/tmp
+ for SHARD in {00..06}; do
+ cat /bigger/scholar/release_export_expanded.split_$SHARD.json \
+ | parallel -j8 --line-buffer --compress --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \
+ | pv -l \
+ | pigz > /grande/snapshots/fatcat_scholar_work_fulltext.split_$SHARD.json.gz
+ done
+
+Record counts:
+
+ 24.7M 15:09:08 [ 452 /s]
+ 24.7M 16:11:22 [ 423 /s]
+ 24.7M 16:38:19 [ 412 /s]
+ 24.7M 17:29:46 [ 392 /s]
+ 24.7M 14:55:53 [ 459 /s]
+ 24.7M 15:02:49 [ 456 /s]
+ 2M 1:10:36 [ 472 /s]
+
+Have made transform code changes, now at git rev 7603dd0ade23e22197acd1fd1d35962c314cf797.
+
+Transform and index, on svc097 machine:
+
+ ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_*.json.gz \
+ | gunzip \
+ | head -n2000000 \
+ | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
+ | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc
+
+Derp, got a batch-size error. Let's try even smaller for the full batch:
+
+ ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_*.json.gz \
+ | gunzip \
+ | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
+ | esbulk -verbose -size 50 -id key -w 4 -index scholar_fulltext_v01 -type _doc
+