diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-07-21 13:48:26 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-07-21 13:48:26 -0700 |
commit | b1d881fd752acbc9661c5924c0e9886545b05c6a (patch) | |
tree | f755bbd276d587812b760165b194ae6ab40258f8 /notes | |
parent | c183aff1713e02bddf28f591896872f143ace3f9 (diff) | |
download | fatcat-scholar-b1d881fd752acbc9661c5924c0e9886545b05c6a.tar.gz fatcat-scholar-b1d881fd752acbc9661c5924c0e9886545b05c6a.zip |
scale-up notes
Diffstat (limited to 'notes')
-rw-r--r-- | notes/scaling_works.md | 26 |
1 files changed, 26 insertions, 0 deletions
diff --git a/notes/scaling_works.md b/notes/scaling_works.md new file mode 100644 index 0000000..814e46f --- /dev/null +++ b/notes/scaling_works.md @@ -0,0 +1,26 @@ + +Run a partial ~5 million paper batch through: + + zcat /srv/fatcat_scholar/release_export.2019-07-07.5mil_fulltext.json.gz \ + | parallel -j8 --linebuffer --round-robin --pipe python -m fatcat_scholar.work_pipeline run_releases \ + | pv -l \ + | gzip > data/work_intermediate.5mil.json.gz + => 5M 21:36:14 [64.3 /s] + + # runs about 70 works/sec with this parallelism => 1mil in 4hr, 5mil in 20hr + # looks like seaweedfs is bottleneck? + # tried stopping persist workers on seaweedfs and basically no change + + indexing to ES seems to take... an hour per million? or so. can check index + monitoring to get better number + +## Work Grouping + +Plan for work-grouped expanded release dumps: + +Have release identifier dump script include, and sort by, `work_id`. This will +definitely slow down that stage, unclear if too much. `work_id` is indexed. + +Bulk dump script iterates and makes work batches of releases to dump, passes +Vec to worker threads. Worker threads pass back Vec of entities, then print all +of them (same work) sequentially. |