diff options
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/20190911_v04_schema_tweaks.md | 38 | ||||
-rw-r--r-- | proposals/20191018_bigger_db.md | 81 |
2 files changed, 119 insertions, 0 deletions
diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md new file mode 100644 index 00000000..3d1e04c1 --- /dev/null +++ b/proposals/20190911_v04_schema_tweaks.md @@ -0,0 +1,38 @@ + +status: work-in-progress + +Proposed schema changes for next fatcat iteration (v0.4? v0.5?). + +SQL (and API, and elasticsearch): + +- container:`container_status` as a string enum: eg, "stub", + "out-of-print"/"ended" (?), "active", "new"/"small" (?). Particularly to + deal with disambiguation of multiple containers by the same title but + separate ISSN-L. For example, "The Lancet". +- release: `release_month` (to complement `release_date` and `release_year`) +- file: `file_scope` as a string enum indicating how much content this file + includes. Eg, `book`, `chapter`, `article`/`work`, `issue`, `volume`, + `abstract`, `component`. Unclear how to initialize this field; default to + `article`/`work`? +- TODO: release: switch how pages work? first/last? + +API tweaks: + +- add regex restrictions on more `ext_ids`, especially `wikidata_qid` +- add explicit enums for more keyword fields + +API endpoints: + +- `GET /auth/token/<editor_id>` endpoint to generate new API token for given + editor. Used by web interface, or bot wranglers. +- create editor endpoint, to allow bot account creation +- `GET /editor/<ident>/bots` (?) endpoint to enumerate bots wrangled by a + specific editor + +Elasticsearch schema: + +- releases *may* need an "_all" field (or `biblio`?) containing most fields to + make some search experiences work +- releases should include volume, issue, pages +- releases *could* include reference and creator lists, as a faster/cheaper + mechanism for doing reverse lookups diff --git a/proposals/20191018_bigger_db.md b/proposals/20191018_bigger_db.md new file mode 100644 index 00000000..cd5f6e7b --- /dev/null +++ b/proposals/20191018_bigger_db.md @@ -0,0 +1,81 @@ + +How can we scale the fatcat backend to support: + +- one billion release entities +- 5 files, 1 webcapture, 1 fileset per release (average) +- 2 abstracts per release (average) +- 100 revisions per release +- average of 10 creators and 50 linked references per release revision + +Motivated by: +- 200 million paper works; 300 million releases +- 200 million books; 300 million editions +- 100 million greylit +- 100 million blog posts +- 100 million other web/platform things +=> 900 million releases, round to 100 million + +Want "abundance" for release edits, not concern about over-editing, thus the +100 reversion number. Break that down as: + +- 5 publisher metadata updates +- 3 updates of container/publisher +- 3 updates to merge under works +- 5 updates to fix release type, stage, license +- 10 other general metadata fixes (title, abstract, language, etc) +- 10 updates to add/fix external identifiers +- 20-50 = update per reference (linking) +- 10-20 = updates per contrib (linking) +=> 66-106 edits; round to 100 +=> almost no updates touch both reference and contribs +=> 1/3 to 1/2 of edits don't update either + +this would mean: + +- 1 billion release idents (10x current) +- 100 billion release revisions and edits (1000x current) +- 2 billion changelog entries (1000x current) +- 1 trillion creator rows (vastly larger) +- 5 trillion reference rows (vastly larger) + +based on current row sizes: +- release_ident: 77 GByte data, 140+ GByte index => 220+ GByte +- release_rev: 44 => 44 TByte +- contribs: 32 G => 32 TByte +- release_edit: 11 Gbyte => 11 TByte +- refs_blob: 77 G => 77 TByte (and maybe larger?) + +No table/index over 1 TByte? + +That's crazy for reference and contribs, unsustainable. Need to assume those +only get updated when actually updated, thus more like 10x per release: 3.2 and +7.7 TByte. + +Another way to estimate is from crossref dump size, which I think is now like +300 GBytes JSON uncompressed for ~100 million works with many references and +other metadata included. 1 billion would be about 3 TBytes. 100 edits would +mean 300 TBytes; 10 edits would mean 30 TBytes. + +What wants to be on an SSD? Just the most recent version. That would mean +closer to the 3 TByte size. Let's double that for other entities and hot +tables, then double again for indexes: 12 TBytes. Pretty big but doable. + +Roughly, 12 TBytes SSD, 30-100 TBytes nearline (spinning disk). Both need +replication. + +Curious to look at FoundationDB as overall solution; can different +tables/namespaces be on different storage backends? + +Cassandra probably an option for revision storage. And indexing? + +Merging edits and revisions into a single table/index could greatly reduce +index size (needed for, eg, history lookups). + +One plan would be: +- only index most recent versions of entities (contrib, refs, extids, etc), not all revs +- turn either (refs, contribs, abstracts) or entire release entities into + +TODO short term: +- try mass updates in QA: one pass to add release `ext_id` for all releases, + one pass to add release ref links to all releases. see what DB size looks + like. can be dummy data. |