Status: notes ## Catalog Database Scaling How can we scale the fatcat backend to support: - one billion release entities - 5 files, 1 webcapture, 1 fileset per release (average) - 2 abstracts per release (average) - 100 revisions per release - average of 10 creators and 50 linked references per release revision Motivated by: - 200 million paper works; 300 million releases - 200 million books; 300 million editions - 100 million greylit - 100 million blog posts - 100 million other web/platform things => 900 million releases, round to 100 million Want "abundance" for release edits, not concern about over-editing, thus the 100 reversion number. Break that down as: - 5 publisher metadata updates - 3 updates of container/publisher - 3 updates to merge under works - 5 updates to fix release type, stage, license - 10 other general metadata fixes (title, abstract, language, etc) - 10 updates to add/fix external identifiers - 20-50 = update per reference (linking) - 10-20 = updates per contrib (linking) => 66-106 edits; round to 100 => almost no updates touch both reference and contribs => 1/3 to 1/2 of edits don't update either this would mean: - 1 billion release idents (10x current) - 100 billion release revisions and edits (1000x current) - 2 billion changelog entries (1000x current) - 1 trillion creator rows (vastly larger) - 5 trillion reference rows (vastly larger) based on current row sizes: - release_ident: 77 GByte data, 140+ GByte index => 220+ GByte - release_rev: 44 => 44 TByte - contribs: 32 G => 32 TByte - release_edit: 11 Gbyte => 11 TByte - refs_blob: 77 G => 77 TByte (and maybe larger?) No table/index over 1 TByte? That's crazy for reference and contribs, unsustainable. Need to assume those only get updated when actually updated, thus more like 10x per release: 3.2 and 7.7 TByte. Another way to estimate is from crossref dump size, which I think is now like 300 GBytes JSON uncompressed for ~100 million works with many references and other metadata included. 1 billion would be about 3 TBytes. 100 edits would mean 300 TBytes; 10 edits would mean 30 TBytes. What wants to be on an SSD? Just the most recent version. That would mean closer to the 3 TByte size. Let's double that for other entities and hot tables, then double again for indexes: 12 TBytes. Pretty big but doable. Roughly, 12 TBytes SSD, 30-100 TBytes nearline (spinning disk). Both need replication. Curious to look at FoundationDB as overall solution; can different tables/namespaces be on different storage backends? Cassandra probably an option for revision storage. And indexing? Merging edits and revisions into a single table/index could greatly reduce index size (needed for, eg, history lookups). One plan would be: - only index most recent versions of entities (contrib, refs, extids, etc), not all revs - turn either (refs, contribs, abstracts) or entire release entities into TODO short term: - try mass updates in QA: one pass to add release `ext_id` for all releases, one pass to add release ref links to all releases. see what DB size looks like. can be dummy data.