summaryrefslogtreecommitdiffstats
path: root/notes/scaling_works.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-08-17 14:44:23 -0700
committerBryan Newbold <bnewbold@archive.org>2020-08-17 14:44:23 -0700
commit8ef87c02693a51b319249632e3219d8414ce8c13 (patch)
tree0fffb2d44a2c4853d104f6c11dd1c493b32f80fa /notes/scaling_works.md
parenta1c3d468cdc1ca9141a2cd16222644d4831324a4 (diff)
downloadfatcat-scholar-8ef87c02693a51b319249632e3219d8414ce8c13.tar.gz
fatcat-scholar-8ef87c02693a51b319249632e3219d8414ce8c13.zip
more scaling notes
Diffstat (limited to 'notes/scaling_works.md')
-rw-r--r--notes/scaling_works.md55
1 files changed, 53 insertions, 2 deletions
diff --git a/notes/scaling_works.md b/notes/scaling_works.md
index ac8645d..76acaa1 100644
--- a/notes/scaling_works.md
+++ b/notes/scaling_works.md
@@ -550,10 +550,61 @@ Transform and index, on svc097 machine:
| sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
| esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc
-Derp, got a batch-size error. Let's try even smaller for the full batch:
+Derp, got a batch-size error. But maybe was just a single huge doc? Added a
+hack to try and skip transform of very large docs to start. In the future
+should truncate specific fields (probably fulltext).
+
+Ahah, actual error was:
+
+ 2020/08/12 23:19:15 {"mapper_parsing_exception" "failed to parse field [biblio.issue_int] of type [short] in document with id 'work_aezuqrgnnfcezkkeoyonr6ll54'. Preview of field's value: '48844'" "" "" ""}
+
+Full indexing:
ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_*.json.gz \
| gunzip \
| sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
- | esbulk -verbose -size 50 -id key -w 4 -index scholar_fulltext_v01 -type _doc
+ | pv -l \
+ | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
+ 2> /tmp/error.txt 1> /tmp/output.txt
+
+Started: 2020-08-12 14:24
+
+ 6.71M 2:46:56 [ 590 /s]
+
+Yikes, is this going to take 60 hours to index? CPU and disk seem to be
+basically maxed out, so don't think tweaking batch size or parallelism would
+help much.
+
+NOTE: tail -n +700000
+NOTE: could filter line size: awk 'length($0) < 16384'
+
+Had some hardware (?) issue and had to restart.
+
+ ssh aitio.us.archive.org cat /grande/snapshots/fatcat_scholar_work_fulltext.split_{00..06}.json.gz \
+ | gunzip \
+ | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
+ | pv -l \
+ | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
+ 2> /tmp/error.txt 1> /tmp/output.txt
+
+ => 150M 69:00:35 [ 604 /s]
+
+ => green open scholar_fulltext_v01 2KrkdhuhRDa6SdNC36XR0A 12 0 150232272 130 1.3tb 1.3tb
+ => Filesystem Size Used Avail Use% Mounted on
+ => /dev/vda1 3.5T 1.4T 2.0T 42% /
+
+ ssh aitio.us.archive.org cat /bigger/scholar_old/sim_intermediate.2020-07-23.json.gz \
+ | gunzip \
+ | sudo -u fatcat parallel -j8 --linebuffer --round-robin --pipe pipenv run python -m fatcat_scholar.transform run_transform \
+ | esbulk -verbose -size 100 -id key -w 4 -index scholar_fulltext_v01 -type _doc \
+ 2> /tmp/error.txt 1> /tmp/output.txt
+
+ => 2020/08/16 21:51:14 1895778 docs in 2h22m55.61416094s at 221.066 docs/s with 4 workers
+
+ => green open scholar_fulltext_v01 2KrkdhuhRDa6SdNC36XR0A 12 0 152090351 26071 1.3tb 1.3tb
+ => Filesystem Size Used Avail Use% Mounted on
+ => /dev/vda1 3.5T 1.4T 2.0T 42% /
+
+Stop elasticsearch, `sync`, restart, to ensure index is fully flushed to disk.
+Some warm-up queries: "*", "blood", "to be or not to be"