aboutsummaryrefslogtreecommitdiffstats
path: root/proposals
diff options
context:
space:
mode:
Diffstat (limited to 'proposals')
-rw-r--r--proposals/20190911_v04_schema_tweaks.md38
-rw-r--r--proposals/20191018_bigger_db.md81
2 files changed, 119 insertions, 0 deletions
diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md
new file mode 100644
index 00000000..3d1e04c1
--- /dev/null
+++ b/proposals/20190911_v04_schema_tweaks.md
@@ -0,0 +1,38 @@
+
+status: work-in-progress
+
+Proposed schema changes for next fatcat iteration (v0.4? v0.5?).
+
+SQL (and API, and elasticsearch):
+
+- container:`container_status` as a string enum: eg, "stub",
+ "out-of-print"/"ended" (?), "active", "new"/"small" (?). Particularly to
+ deal with disambiguation of multiple containers by the same title but
+ separate ISSN-L. For example, "The Lancet".
+- release: `release_month` (to complement `release_date` and `release_year`)
+- file: `file_scope` as a string enum indicating how much content this file
+ includes. Eg, `book`, `chapter`, `article`/`work`, `issue`, `volume`,
+ `abstract`, `component`. Unclear how to initialize this field; default to
+ `article`/`work`?
+- TODO: release: switch how pages work? first/last?
+
+API tweaks:
+
+- add regex restrictions on more `ext_ids`, especially `wikidata_qid`
+- add explicit enums for more keyword fields
+
+API endpoints:
+
+- `GET /auth/token/<editor_id>` endpoint to generate new API token for given
+ editor. Used by web interface, or bot wranglers.
+- create editor endpoint, to allow bot account creation
+- `GET /editor/<ident>/bots` (?) endpoint to enumerate bots wrangled by a
+ specific editor
+
+Elasticsearch schema:
+
+- releases *may* need an "_all" field (or `biblio`?) containing most fields to
+ make some search experiences work
+- releases should include volume, issue, pages
+- releases *could* include reference and creator lists, as a faster/cheaper
+ mechanism for doing reverse lookups
diff --git a/proposals/20191018_bigger_db.md b/proposals/20191018_bigger_db.md
new file mode 100644
index 00000000..cd5f6e7b
--- /dev/null
+++ b/proposals/20191018_bigger_db.md
@@ -0,0 +1,81 @@
+
+How can we scale the fatcat backend to support:
+
+- one billion release entities
+- 5 files, 1 webcapture, 1 fileset per release (average)
+- 2 abstracts per release (average)
+- 100 revisions per release
+- average of 10 creators and 50 linked references per release revision
+
+Motivated by:
+- 200 million paper works; 300 million releases
+- 200 million books; 300 million editions
+- 100 million greylit
+- 100 million blog posts
+- 100 million other web/platform things
+=> 900 million releases, round to 100 million
+
+Want "abundance" for release edits, not concern about over-editing, thus the
+100 reversion number. Break that down as:
+
+- 5 publisher metadata updates
+- 3 updates of container/publisher
+- 3 updates to merge under works
+- 5 updates to fix release type, stage, license
+- 10 other general metadata fixes (title, abstract, language, etc)
+- 10 updates to add/fix external identifiers
+- 20-50 = update per reference (linking)
+- 10-20 = updates per contrib (linking)
+=> 66-106 edits; round to 100
+=> almost no updates touch both reference and contribs
+=> 1/3 to 1/2 of edits don't update either
+
+this would mean:
+
+- 1 billion release idents (10x current)
+- 100 billion release revisions and edits (1000x current)
+- 2 billion changelog entries (1000x current)
+- 1 trillion creator rows (vastly larger)
+- 5 trillion reference rows (vastly larger)
+
+based on current row sizes:
+- release_ident: 77 GByte data, 140+ GByte index => 220+ GByte
+- release_rev: 44 => 44 TByte
+- contribs: 32 G => 32 TByte
+- release_edit: 11 Gbyte => 11 TByte
+- refs_blob: 77 G => 77 TByte (and maybe larger?)
+
+No table/index over 1 TByte?
+
+That's crazy for reference and contribs, unsustainable. Need to assume those
+only get updated when actually updated, thus more like 10x per release: 3.2 and
+7.7 TByte.
+
+Another way to estimate is from crossref dump size, which I think is now like
+300 GBytes JSON uncompressed for ~100 million works with many references and
+other metadata included. 1 billion would be about 3 TBytes. 100 edits would
+mean 300 TBytes; 10 edits would mean 30 TBytes.
+
+What wants to be on an SSD? Just the most recent version. That would mean
+closer to the 3 TByte size. Let's double that for other entities and hot
+tables, then double again for indexes: 12 TBytes. Pretty big but doable.
+
+Roughly, 12 TBytes SSD, 30-100 TBytes nearline (spinning disk). Both need
+replication.
+
+Curious to look at FoundationDB as overall solution; can different
+tables/namespaces be on different storage backends?
+
+Cassandra probably an option for revision storage. And indexing?
+
+Merging edits and revisions into a single table/index could greatly reduce
+index size (needed for, eg, history lookups).
+
+One plan would be:
+- only index most recent versions of entities (contrib, refs, extids, etc), not all revs
+- turn either (refs, contribs, abstracts) or entire release entities into
+
+TODO short term:
+- try mass updates in QA: one pass to add release `ext_id` for all releases,
+ one pass to add release ref links to all releases. see what DB size looks
+ like. can be dummy data.