diff options
| author | Bryan Newbold <bnewbold@robocracy.org> | 2020-01-21 18:56:43 -0800 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@robocracy.org> | 2020-01-21 18:56:43 -0800 | 
| commit | 2a1f93f0a0fa3cbded45bd14a38452b49bd6578c (patch) | |
| tree | c28967db0f3685b68b1542054ba141f0c0c895ab /proposals | |
| parent | 9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb (diff) | |
| download | fatcat-2a1f93f0a0fa3cbded45bd14a38452b49bd6578c.tar.gz fatcat-2a1f93f0a0fa3cbded45bd14a38452b49bd6578c.zip | |
proposal of ideas for reducing database size
Diffstat (limited to 'proposals')
| -rw-r--r-- | proposals/2020_sql_size_reduction.md | 154 | 
1 files changed, 154 insertions, 0 deletions
| diff --git a/proposals/2020_sql_size_reduction.md b/proposals/2020_sql_size_reduction.md new file mode 100644 index 00000000..8648c19a --- /dev/null +++ b/proposals/2020_sql_size_reduction.md @@ -0,0 +1,154 @@ + +status: brainstorming + +This document is tracking ideas to reduce fatcat catalog database size (postgresql). + +As of January 2020, the prod database is over 470 GBytes, on DB hosts with 2 +TByte disks. We are probably fine doubling (up to ~1000 GBytes) with no serious +issues, and there are lots of ideas beyond that (4 TByte SSDs disks, multiple +SSDs using RAID or postgres table spaces, some spinning disk usage, etc), but +in the meanwhile it might be worth trying to slim down the existing schema. + +See also `20191018_bigger_db.md` about far future database size. Scale-out +(horizontal) and other larger refactors are out of scope for this document. + + +## Current Database Usage and Growth + +`extra/stats/2020-01-19-prod-table-sizes.txt` shows table data and index sizes. + +To categorize these in a few different ways: + +- release entities (incl. edits, ident, revs) +    - 72%   337 GB total, 215 GB data, 122 GB index +- file entities (incl. edits, ident, revs): +    -  7%    34 GB total,  18 GB data,  16 GB index +- release entity revs (no edits, ident) +    - 61%   287 GB total, 200 GB data,  87 GB index +- entity edits only +    - 15%    71 GB total,  30 GB data,  40 GB index + +`refs_blob` is large, but is "content addressed", so as long as we don't +*update* the *non-linkage* reference metadata, will not grow. We are also +planning on holding off on dumping all references (citation graph) directly +into the catalog at this time. + +`release_contrib`, `release_rev`, and `release_edit` are more concerning, as +these are expected to grow linearly (or faster). Just those three tables are +46% of the database. + +We expect to both update almost all files and quadruple the number of files in +the near future (roughly 25m currently; expect 80m+20m edits soon, so 5x +database size). Will also be filling in additional hashes and other metadata. +All that together would be only some 80 GB additional disk size; maybe less if +we de-dupe URLs. + +Releases we are likely to want to do something like 20 million entity updates (as +part of cleanups and merging) and add another 20 million entities (from CORE, +MAG, longtail OA, etc). That would be about a 30% increase in rev count, or +another 100 GB of data+index. + +Other growth is expected to be much smaller, let's say a few GB of disk. + +This works out to a bit over 600 GByte total disk size. + + +## Idea: finish `ext_id` migration and drop columns+index from `release_rev` + +Every `release_rev` table has DOI, CORE, PMID, PMCID, and Wikidata columns, +*and* indices on all of these. For all but DOI, the majority of release revs +have these columns NULL. This doesn't waste much disk, but it does waste index: + +    fatcat_prod=# \di+ release_rev* +                                                  List of relations +     Schema |               Name               | Type  |  Owner   |        Table         |  Size   | Description  +    --------+----------------------------------+-------+----------+----------------------+---------+------------- +     public | release_rev_abstract_pkey        | index | postgres | release_rev_abstract | 469 MB  |  +     public | release_rev_abstract_rev_idx     | index | postgres | release_rev_abstract | 658 MB  |  +     public | release_rev_abstract_sha1_idx    | index | postgres | release_rev_abstract | 1416 MB |  +     public | release_rev_core_idx             | index | postgres | release_rev          | 3156 MB |  +     public | release_rev_doi_idx              | index | postgres | release_rev          | 6124 MB |  +     public | release_rev_extid_pkey           | index | fatcat   | release_rev_extid    | 119 MB  |  +     public | release_rev_extid_type_value_idx | index | fatcat   | release_rev_extid    | 117 MB  |  +     public | release_rev_pkey                 | index | postgres | release_rev          | 4344 MB |  +     public | release_rev_pmcid_idx            | index | postgres | release_rev          | 3146 MB |  +     public | release_rev_pmid_idx             | index | postgres | release_rev          | 3271 MB |  +     public | release_rev_wikidata_idx         | index | postgres | release_rev          | 3223 MB |  +     public | release_rev_work_idx             | index | postgres | release_rev          | 4344 MB |  +    (12 rows) + +That's 3+ GByte for indices with relatively few values. Potentially 12-15 GByte +of savings, even accounting for the fact that the extid table would grow by +several GB. + +To do this, would need to do something like: + +- have fatcatd to write into both rows and extid table on insert/update +- iterate over all all `release_rev` rows, and for each of these columns insert +  into the `ext_id` table if the value is set +- update fatcatd to read from extid table, and stop inserting these columns in +  rev table +- update schema to drop old indices and columns + +This is non-trivial development and operational work; probably a good week for +bnewbold if all goes well? + + +## Idea: drop contrib and reference entity indices + +The contribs table has a foreign key to the creator table. The release table as +a foreign key to the release table. Both of these have "reverse" indices, +allowing things like "all papers for creator" and "all releases referencing +this one". Neither of these are *really* necessary; they could be offloaded to +the search index (would, of course, increase that index size). + +    fatcat_prod=# \di+ release_contrib* +                                        List of relations +     Schema |            Name             | Type  |  Owner   |      Table      | Size  | Description  +    --------+-----------------------------+-------+----------+-----------------+-------+------------- +     public | release_contrib_creator_idx | index | postgres | release_contrib | 11 GB |  +     public | release_contrib_pkey        | index | postgres | release_contrib | 11 GB |  +     public | release_contrib_rev_idx     | index | postgres | release_contrib | 15 GB |  +    (3 rows) + + +    fatcat_prod=# \di+ release_ref* +                                            List of relations +     Schema |              Name              | Type  |  Owner   |    Table    |  Size   | Description  +    --------+--------------------------------+-------+----------+-------------+---------+------------- +     public | release_ref_pkey               | index | postgres | release_ref | 2373 MB |  +     public | release_ref_target_release_idx | index | postgres | release_ref | 1843 MB |  +    (2 rows) + +Looks like about 13 GByte could be saved here. + +## Idea: re-order table columns + +I've read that there can be some data storage savings by moving fields which +are often null to the end of table schemas. + +Need to research/calculate what savings might be. + +Not sure how implementation would work... dump data values (plain or custom +`pg_dump`?) then reload, probably. + +## Idea: `release_rev` and `file_rev` BIGINT for sub-tables + +`_rev` table primary keys are UUIDs (16 bytes). All child tables (eg, +`release_contrib`, `file_rev_url`) have both foreign keys and indices on those +foreign keys to this table. + +We could add new BIGINT secondary identifiers to the `_rev` tables (no index +needed), and switch all the child tables to use that. Because the child tables +are usually much larger than the `_rev` tables, this could save a lot of data +and index space. + +It is a complex change, especially if it was only for `release`/`file`. + +## Idea: Refactor `_edit` tables + +Feels like there might be a bunch of waste here. Could make `{editgroup, +ident}` the primary key, which would remove need for a separate identifier and +index (we already have a UNIQ constraint on this). + +TODO: investigate further. | 
