diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2023-01-04 19:55:30 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2023-01-04 20:18:25 -0800 |
commit | 276ac2aa24166660bc6ffe7601cee44b5d848dae (patch) | |
tree | 8a35ce06e7ab9e6755b24abc41dee1115cf62788 /proposals/20191018_bigger_db.md | |
parent | ee46c33544941a5104182a2e221e841a32cbbf78 (diff) | |
download | fatcat-276ac2aa24166660bc6ffe7601cee44b5d848dae.tar.gz fatcat-276ac2aa24166660bc6ffe7601cee44b5d848dae.zip |
proposals: update status; add some old ones; consistent file names
Diffstat (limited to 'proposals/20191018_bigger_db.md')
-rw-r--r-- | proposals/20191018_bigger_db.md | 85 |
1 files changed, 0 insertions, 85 deletions
diff --git a/proposals/20191018_bigger_db.md b/proposals/20191018_bigger_db.md deleted file mode 100644 index 7a5216d0..00000000 --- a/proposals/20191018_bigger_db.md +++ /dev/null @@ -1,85 +0,0 @@ - -Status: brainstorm - -## Catalog Database Scaling - -How can we scale the fatcat backend to support: - -- one billion release entities -- 5 files, 1 webcapture, 1 fileset per release (average) -- 2 abstracts per release (average) -- 100 revisions per release -- average of 10 creators and 50 linked references per release revision - -Motivated by: -- 200 million paper works; 300 million releases -- 200 million books; 300 million editions -- 100 million greylit -- 100 million blog posts -- 100 million other web/platform things -=> 900 million releases, round to 100 million - -Want "abundance" for release edits, not concern about over-editing, thus the -100 reversion number. Break that down as: - -- 5 publisher metadata updates -- 3 updates of container/publisher -- 3 updates to merge under works -- 5 updates to fix release type, stage, license -- 10 other general metadata fixes (title, abstract, language, etc) -- 10 updates to add/fix external identifiers -- 20-50 = update per reference (linking) -- 10-20 = updates per contrib (linking) -=> 66-106 edits; round to 100 -=> almost no updates touch both reference and contribs -=> 1/3 to 1/2 of edits don't update either - -this would mean: - -- 1 billion release idents (10x current) -- 100 billion release revisions and edits (1000x current) -- 2 billion changelog entries (1000x current) -- 1 trillion creator rows (vastly larger) -- 5 trillion reference rows (vastly larger) - -based on current row sizes: -- release_ident: 77 GByte data, 140+ GByte index => 220+ GByte -- release_rev: 44 => 44 TByte -- contribs: 32 G => 32 TByte -- release_edit: 11 Gbyte => 11 TByte -- refs_blob: 77 G => 77 TByte (and maybe larger?) - -No table/index over 1 TByte? - -That's crazy for reference and contribs, unsustainable. Need to assume those -only get updated when actually updated, thus more like 10x per release: 3.2 and -7.7 TByte. - -Another way to estimate is from crossref dump size, which I think is now like -300 GBytes JSON uncompressed for ~100 million works with many references and -other metadata included. 1 billion would be about 3 TBytes. 100 edits would -mean 300 TBytes; 10 edits would mean 30 TBytes. - -What wants to be on an SSD? Just the most recent version. That would mean -closer to the 3 TByte size. Let's double that for other entities and hot -tables, then double again for indexes: 12 TBytes. Pretty big but doable. - -Roughly, 12 TBytes SSD, 30-100 TBytes nearline (spinning disk). Both need -replication. - -Curious to look at FoundationDB as overall solution; can different -tables/namespaces be on different storage backends? - -Cassandra probably an option for revision storage. And indexing? - -Merging edits and revisions into a single table/index could greatly reduce -index size (needed for, eg, history lookups). - -One plan would be: -- only index most recent versions of entities (contrib, refs, extids, etc), not all revs -- turn either (refs, contribs, abstracts) or entire release entities into - -TODO short term: -- try mass updates in QA: one pass to add release `ext_id` for all releases, - one pass to add release ref links to all releases. see what DB size looks - like. can be dummy data. |