proposals/20191018_bigger_db.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85


Status: brainstorm

## Catalog Database Scaling

How can we scale the fatcat backend to support:

- one billion release entities
- 5 files, 1 webcapture, 1 fileset per release (average)
- 2 abstracts per release (average)
- 100 revisions per release
- average of 10 creators and 50 linked references per release revision

Motivated by:
- 200 million paper works; 300 million releases
- 200 million books; 300 million editions
- 100 million greylit
- 100 million blog posts
- 100 million other web/platform things
=> 900 million releases, round to 100 million

Want "abundance" for release edits, not concern about over-editing, thus the
100 reversion number. Break that down as:

- 5 publisher metadata updates
- 3 updates of container/publisher
- 3 updates to merge under works
- 5 updates to fix release type, stage, license
- 10 other general metadata fixes (title, abstract, language, etc)
- 10 updates to add/fix external identifiers
- 20-50 = update per reference (linking)
- 10-20 = updates per contrib (linking)
=> 66-106 edits; round to 100
=> almost no updates touch both reference and contribs
=> 1/3 to 1/2 of edits don't update either

this would mean:

- 1 billion release idents (10x current)
- 100 billion release revisions and edits (1000x current)
- 2 billion changelog entries (1000x current)
- 1 trillion creator rows (vastly larger)
- 5 trillion reference rows (vastly larger)

based on current row sizes:
- release_ident: 77 GByte data, 140+ GByte index => 220+ GByte
- release_rev: 44 => 44 TByte
- contribs: 32 G => 32 TByte
- release_edit: 11 Gbyte => 11 TByte
- refs_blob: 77 G => 77 TByte (and maybe larger?)

No table/index over 1 TByte?

That's crazy for reference and contribs, unsustainable. Need to assume those
only get updated when actually updated, thus more like 10x per release: 3.2 and
7.7 TByte.

Another way to estimate is from crossref dump size, which I think is now like
300 GBytes JSON uncompressed for ~100 million works with many references and
other metadata included. 1 billion would be about 3 TBytes. 100 edits would
mean 300 TBytes; 10 edits would mean 30 TBytes.

What wants to be on an SSD? Just the most recent version. That would mean
closer to the 3 TByte size. Let's double that for other entities and hot
tables, then double again for indexes: 12 TBytes. Pretty big but doable.

Roughly, 12 TBytes SSD, 30-100 TBytes nearline (spinning disk). Both need
replication.

Curious to look at FoundationDB as overall solution; can different
tables/namespaces be on different storage backends?

Cassandra probably an option for revision storage. And indexing?

Merging edits and revisions into a single table/index could greatly reduce
index size (needed for, eg, history lookups).

One plan would be:
- only index most recent versions of entities (contrib, refs, extids, etc), not all revs
- turn either (refs, contribs, abstracts) or entire release entities into

TODO short term:
- try mass updates in QA: one pass to add release `ext_id` for all releases,
  one pass to add release ref links to all releases. see what DB size looks
  like. can be dummy data.