aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/20191018_bigger_db.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2023-01-04 19:55:30 -0800
committerBryan Newbold <bnewbold@robocracy.org>2023-01-04 20:18:25 -0800
commit276ac2aa24166660bc6ffe7601cee44b5d848dae (patch)
tree8a35ce06e7ab9e6755b24abc41dee1115cf62788 /proposals/20191018_bigger_db.md
parentee46c33544941a5104182a2e221e841a32cbbf78 (diff)
downloadfatcat-276ac2aa24166660bc6ffe7601cee44b5d848dae.tar.gz
fatcat-276ac2aa24166660bc6ffe7601cee44b5d848dae.zip
proposals: update status; add some old ones; consistent file names
Diffstat (limited to 'proposals/20191018_bigger_db.md')
-rw-r--r--proposals/20191018_bigger_db.md85
1 files changed, 0 insertions, 85 deletions
diff --git a/proposals/20191018_bigger_db.md b/proposals/20191018_bigger_db.md
deleted file mode 100644
index 7a5216d0..00000000
--- a/proposals/20191018_bigger_db.md
+++ /dev/null
@@ -1,85 +0,0 @@
-
-Status: brainstorm
-
-## Catalog Database Scaling
-
-How can we scale the fatcat backend to support:
-
-- one billion release entities
-- 5 files, 1 webcapture, 1 fileset per release (average)
-- 2 abstracts per release (average)
-- 100 revisions per release
-- average of 10 creators and 50 linked references per release revision
-
-Motivated by:
-- 200 million paper works; 300 million releases
-- 200 million books; 300 million editions
-- 100 million greylit
-- 100 million blog posts
-- 100 million other web/platform things
-=> 900 million releases, round to 100 million
-
-Want "abundance" for release edits, not concern about over-editing, thus the
-100 reversion number. Break that down as:
-
-- 5 publisher metadata updates
-- 3 updates of container/publisher
-- 3 updates to merge under works
-- 5 updates to fix release type, stage, license
-- 10 other general metadata fixes (title, abstract, language, etc)
-- 10 updates to add/fix external identifiers
-- 20-50 = update per reference (linking)
-- 10-20 = updates per contrib (linking)
-=> 66-106 edits; round to 100
-=> almost no updates touch both reference and contribs
-=> 1/3 to 1/2 of edits don't update either
-
-this would mean:
-
-- 1 billion release idents (10x current)
-- 100 billion release revisions and edits (1000x current)
-- 2 billion changelog entries (1000x current)
-- 1 trillion creator rows (vastly larger)
-- 5 trillion reference rows (vastly larger)
-
-based on current row sizes:
-- release_ident: 77 GByte data, 140+ GByte index => 220+ GByte
-- release_rev: 44 => 44 TByte
-- contribs: 32 G => 32 TByte
-- release_edit: 11 Gbyte => 11 TByte
-- refs_blob: 77 G => 77 TByte (and maybe larger?)
-
-No table/index over 1 TByte?
-
-That's crazy for reference and contribs, unsustainable. Need to assume those
-only get updated when actually updated, thus more like 10x per release: 3.2 and
-7.7 TByte.
-
-Another way to estimate is from crossref dump size, which I think is now like
-300 GBytes JSON uncompressed for ~100 million works with many references and
-other metadata included. 1 billion would be about 3 TBytes. 100 edits would
-mean 300 TBytes; 10 edits would mean 30 TBytes.
-
-What wants to be on an SSD? Just the most recent version. That would mean
-closer to the 3 TByte size. Let's double that for other entities and hot
-tables, then double again for indexes: 12 TBytes. Pretty big but doable.
-
-Roughly, 12 TBytes SSD, 30-100 TBytes nearline (spinning disk). Both need
-replication.
-
-Curious to look at FoundationDB as overall solution; can different
-tables/namespaces be on different storage backends?
-
-Cassandra probably an option for revision storage. And indexing?
-
-Merging edits and revisions into a single table/index could greatly reduce
-index size (needed for, eg, history lookups).
-
-One plan would be:
-- only index most recent versions of entities (contrib, refs, extids, etc), not all revs
-- turn either (refs, contribs, abstracts) or entire release entities into
-
-TODO short term:
-- try mass updates in QA: one pass to add release `ext_id` for all releases,
- one pass to add release ref links to all releases. see what DB size looks
- like. can be dummy data.