diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2020-01-21 18:56:43 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2020-01-21 18:56:43 -0800 |
commit | 2a1f93f0a0fa3cbded45bd14a38452b49bd6578c (patch) | |
tree | c28967db0f3685b68b1542054ba141f0c0c895ab | |
parent | 9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb (diff) | |
download | fatcat-2a1f93f0a0fa3cbded45bd14a38452b49bd6578c.tar.gz fatcat-2a1f93f0a0fa3cbded45bd14a38452b49bd6578c.zip |
proposal of ideas for reducing database size
-rw-r--r-- | proposals/2020_sql_size_reduction.md | 154 |
1 files changed, 154 insertions, 0 deletions
diff --git a/proposals/2020_sql_size_reduction.md b/proposals/2020_sql_size_reduction.md new file mode 100644 index 00000000..8648c19a --- /dev/null +++ b/proposals/2020_sql_size_reduction.md @@ -0,0 +1,154 @@ + +status: brainstorming + +This document is tracking ideas to reduce fatcat catalog database size (postgresql). + +As of January 2020, the prod database is over 470 GBytes, on DB hosts with 2 +TByte disks. We are probably fine doubling (up to ~1000 GBytes) with no serious +issues, and there are lots of ideas beyond that (4 TByte SSDs disks, multiple +SSDs using RAID or postgres table spaces, some spinning disk usage, etc), but +in the meanwhile it might be worth trying to slim down the existing schema. + +See also `20191018_bigger_db.md` about far future database size. Scale-out +(horizontal) and other larger refactors are out of scope for this document. + + +## Current Database Usage and Growth + +`extra/stats/2020-01-19-prod-table-sizes.txt` shows table data and index sizes. + +To categorize these in a few different ways: + +- release entities (incl. edits, ident, revs) + - 72% 337 GB total, 215 GB data, 122 GB index +- file entities (incl. edits, ident, revs): + - 7% 34 GB total, 18 GB data, 16 GB index +- release entity revs (no edits, ident) + - 61% 287 GB total, 200 GB data, 87 GB index +- entity edits only + - 15% 71 GB total, 30 GB data, 40 GB index + +`refs_blob` is large, but is "content addressed", so as long as we don't +*update* the *non-linkage* reference metadata, will not grow. We are also +planning on holding off on dumping all references (citation graph) directly +into the catalog at this time. + +`release_contrib`, `release_rev`, and `release_edit` are more concerning, as +these are expected to grow linearly (or faster). Just those three tables are +46% of the database. + +We expect to both update almost all files and quadruple the number of files in +the near future (roughly 25m currently; expect 80m+20m edits soon, so 5x +database size). Will also be filling in additional hashes and other metadata. +All that together would be only some 80 GB additional disk size; maybe less if +we de-dupe URLs. + +Releases we are likely to want to do something like 20 million entity updates (as +part of cleanups and merging) and add another 20 million entities (from CORE, +MAG, longtail OA, etc). That would be about a 30% increase in rev count, or +another 100 GB of data+index. + +Other growth is expected to be much smaller, let's say a few GB of disk. + +This works out to a bit over 600 GByte total disk size. + + +## Idea: finish `ext_id` migration and drop columns+index from `release_rev` + +Every `release_rev` table has DOI, CORE, PMID, PMCID, and Wikidata columns, +*and* indices on all of these. For all but DOI, the majority of release revs +have these columns NULL. This doesn't waste much disk, but it does waste index: + + fatcat_prod=# \di+ release_rev* + List of relations + Schema | Name | Type | Owner | Table | Size | Description + --------+----------------------------------+-------+----------+----------------------+---------+------------- + public | release_rev_abstract_pkey | index | postgres | release_rev_abstract | 469 MB | + public | release_rev_abstract_rev_idx | index | postgres | release_rev_abstract | 658 MB | + public | release_rev_abstract_sha1_idx | index | postgres | release_rev_abstract | 1416 MB | + public | release_rev_core_idx | index | postgres | release_rev | 3156 MB | + public | release_rev_doi_idx | index | postgres | release_rev | 6124 MB | + public | release_rev_extid_pkey | index | fatcat | release_rev_extid | 119 MB | + public | release_rev_extid_type_value_idx | index | fatcat | release_rev_extid | 117 MB | + public | release_rev_pkey | index | postgres | release_rev | 4344 MB | + public | release_rev_pmcid_idx | index | postgres | release_rev | 3146 MB | + public | release_rev_pmid_idx | index | postgres | release_rev | 3271 MB | + public | release_rev_wikidata_idx | index | postgres | release_rev | 3223 MB | + public | release_rev_work_idx | index | postgres | release_rev | 4344 MB | + (12 rows) + +That's 3+ GByte for indices with relatively few values. Potentially 12-15 GByte +of savings, even accounting for the fact that the extid table would grow by +several GB. + +To do this, would need to do something like: + +- have fatcatd to write into both rows and extid table on insert/update +- iterate over all all `release_rev` rows, and for each of these columns insert + into the `ext_id` table if the value is set +- update fatcatd to read from extid table, and stop inserting these columns in + rev table +- update schema to drop old indices and columns + +This is non-trivial development and operational work; probably a good week for +bnewbold if all goes well? + + +## Idea: drop contrib and reference entity indices + +The contribs table has a foreign key to the creator table. The release table as +a foreign key to the release table. Both of these have "reverse" indices, +allowing things like "all papers for creator" and "all releases referencing +this one". Neither of these are *really* necessary; they could be offloaded to +the search index (would, of course, increase that index size). + + fatcat_prod=# \di+ release_contrib* + List of relations + Schema | Name | Type | Owner | Table | Size | Description + --------+-----------------------------+-------+----------+-----------------+-------+------------- + public | release_contrib_creator_idx | index | postgres | release_contrib | 11 GB | + public | release_contrib_pkey | index | postgres | release_contrib | 11 GB | + public | release_contrib_rev_idx | index | postgres | release_contrib | 15 GB | + (3 rows) + + + fatcat_prod=# \di+ release_ref* + List of relations + Schema | Name | Type | Owner | Table | Size | Description + --------+--------------------------------+-------+----------+-------------+---------+------------- + public | release_ref_pkey | index | postgres | release_ref | 2373 MB | + public | release_ref_target_release_idx | index | postgres | release_ref | 1843 MB | + (2 rows) + +Looks like about 13 GByte could be saved here. + +## Idea: re-order table columns + +I've read that there can be some data storage savings by moving fields which +are often null to the end of table schemas. + +Need to research/calculate what savings might be. + +Not sure how implementation would work... dump data values (plain or custom +`pg_dump`?) then reload, probably. + +## Idea: `release_rev` and `file_rev` BIGINT for sub-tables + +`_rev` table primary keys are UUIDs (16 bytes). All child tables (eg, +`release_contrib`, `file_rev_url`) have both foreign keys and indices on those +foreign keys to this table. + +We could add new BIGINT secondary identifiers to the `_rev` tables (no index +needed), and switch all the child tables to use that. Because the child tables +are usually much larger than the `_rev` tables, this could save a lot of data +and index space. + +It is a complex change, especially if it was only for `release`/`file`. + +## Idea: Refactor `_edit` tables + +Feels like there might be a bunch of waste here. Could make `{editgroup, +ident}` the primary key, which would remove need for a separate identifier and +index (we already have a UNIQ constraint on this). + +TODO: investigate further. |