summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-01-21 18:56:43 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-01-21 18:56:43 -0800
commit2a1f93f0a0fa3cbded45bd14a38452b49bd6578c (patch)
treec28967db0f3685b68b1542054ba141f0c0c895ab
parent9d3abb010249576ddc6c86b4c7c4c5bbb6561ecb (diff)
downloadfatcat-2a1f93f0a0fa3cbded45bd14a38452b49bd6578c.tar.gz
fatcat-2a1f93f0a0fa3cbded45bd14a38452b49bd6578c.zip
proposal of ideas for reducing database size
-rw-r--r--proposals/2020_sql_size_reduction.md154
1 files changed, 154 insertions, 0 deletions
diff --git a/proposals/2020_sql_size_reduction.md b/proposals/2020_sql_size_reduction.md
new file mode 100644
index 00000000..8648c19a
--- /dev/null
+++ b/proposals/2020_sql_size_reduction.md
@@ -0,0 +1,154 @@
+
+status: brainstorming
+
+This document is tracking ideas to reduce fatcat catalog database size (postgresql).
+
+As of January 2020, the prod database is over 470 GBytes, on DB hosts with 2
+TByte disks. We are probably fine doubling (up to ~1000 GBytes) with no serious
+issues, and there are lots of ideas beyond that (4 TByte SSDs disks, multiple
+SSDs using RAID or postgres table spaces, some spinning disk usage, etc), but
+in the meanwhile it might be worth trying to slim down the existing schema.
+
+See also `20191018_bigger_db.md` about far future database size. Scale-out
+(horizontal) and other larger refactors are out of scope for this document.
+
+
+## Current Database Usage and Growth
+
+`extra/stats/2020-01-19-prod-table-sizes.txt` shows table data and index sizes.
+
+To categorize these in a few different ways:
+
+- release entities (incl. edits, ident, revs)
+ - 72% 337 GB total, 215 GB data, 122 GB index
+- file entities (incl. edits, ident, revs):
+ - 7% 34 GB total, 18 GB data, 16 GB index
+- release entity revs (no edits, ident)
+ - 61% 287 GB total, 200 GB data, 87 GB index
+- entity edits only
+ - 15% 71 GB total, 30 GB data, 40 GB index
+
+`refs_blob` is large, but is "content addressed", so as long as we don't
+*update* the *non-linkage* reference metadata, will not grow. We are also
+planning on holding off on dumping all references (citation graph) directly
+into the catalog at this time.
+
+`release_contrib`, `release_rev`, and `release_edit` are more concerning, as
+these are expected to grow linearly (or faster). Just those three tables are
+46% of the database.
+
+We expect to both update almost all files and quadruple the number of files in
+the near future (roughly 25m currently; expect 80m+20m edits soon, so 5x
+database size). Will also be filling in additional hashes and other metadata.
+All that together would be only some 80 GB additional disk size; maybe less if
+we de-dupe URLs.
+
+Releases we are likely to want to do something like 20 million entity updates (as
+part of cleanups and merging) and add another 20 million entities (from CORE,
+MAG, longtail OA, etc). That would be about a 30% increase in rev count, or
+another 100 GB of data+index.
+
+Other growth is expected to be much smaller, let's say a few GB of disk.
+
+This works out to a bit over 600 GByte total disk size.
+
+
+## Idea: finish `ext_id` migration and drop columns+index from `release_rev`
+
+Every `release_rev` table has DOI, CORE, PMID, PMCID, and Wikidata columns,
+*and* indices on all of these. For all but DOI, the majority of release revs
+have these columns NULL. This doesn't waste much disk, but it does waste index:
+
+ fatcat_prod=# \di+ release_rev*
+ List of relations
+ Schema | Name | Type | Owner | Table | Size | Description
+ --------+----------------------------------+-------+----------+----------------------+---------+-------------
+ public | release_rev_abstract_pkey | index | postgres | release_rev_abstract | 469 MB |
+ public | release_rev_abstract_rev_idx | index | postgres | release_rev_abstract | 658 MB |
+ public | release_rev_abstract_sha1_idx | index | postgres | release_rev_abstract | 1416 MB |
+ public | release_rev_core_idx | index | postgres | release_rev | 3156 MB |
+ public | release_rev_doi_idx | index | postgres | release_rev | 6124 MB |
+ public | release_rev_extid_pkey | index | fatcat | release_rev_extid | 119 MB |
+ public | release_rev_extid_type_value_idx | index | fatcat | release_rev_extid | 117 MB |
+ public | release_rev_pkey | index | postgres | release_rev | 4344 MB |
+ public | release_rev_pmcid_idx | index | postgres | release_rev | 3146 MB |
+ public | release_rev_pmid_idx | index | postgres | release_rev | 3271 MB |
+ public | release_rev_wikidata_idx | index | postgres | release_rev | 3223 MB |
+ public | release_rev_work_idx | index | postgres | release_rev | 4344 MB |
+ (12 rows)
+
+That's 3+ GByte for indices with relatively few values. Potentially 12-15 GByte
+of savings, even accounting for the fact that the extid table would grow by
+several GB.
+
+To do this, would need to do something like:
+
+- have fatcatd to write into both rows and extid table on insert/update
+- iterate over all all `release_rev` rows, and for each of these columns insert
+ into the `ext_id` table if the value is set
+- update fatcatd to read from extid table, and stop inserting these columns in
+ rev table
+- update schema to drop old indices and columns
+
+This is non-trivial development and operational work; probably a good week for
+bnewbold if all goes well?
+
+
+## Idea: drop contrib and reference entity indices
+
+The contribs table has a foreign key to the creator table. The release table as
+a foreign key to the release table. Both of these have "reverse" indices,
+allowing things like "all papers for creator" and "all releases referencing
+this one". Neither of these are *really* necessary; they could be offloaded to
+the search index (would, of course, increase that index size).
+
+ fatcat_prod=# \di+ release_contrib*
+ List of relations
+ Schema | Name | Type | Owner | Table | Size | Description
+ --------+-----------------------------+-------+----------+-----------------+-------+-------------
+ public | release_contrib_creator_idx | index | postgres | release_contrib | 11 GB |
+ public | release_contrib_pkey | index | postgres | release_contrib | 11 GB |
+ public | release_contrib_rev_idx | index | postgres | release_contrib | 15 GB |
+ (3 rows)
+
+
+ fatcat_prod=# \di+ release_ref*
+ List of relations
+ Schema | Name | Type | Owner | Table | Size | Description
+ --------+--------------------------------+-------+----------+-------------+---------+-------------
+ public | release_ref_pkey | index | postgres | release_ref | 2373 MB |
+ public | release_ref_target_release_idx | index | postgres | release_ref | 1843 MB |
+ (2 rows)
+
+Looks like about 13 GByte could be saved here.
+
+## Idea: re-order table columns
+
+I've read that there can be some data storage savings by moving fields which
+are often null to the end of table schemas.
+
+Need to research/calculate what savings might be.
+
+Not sure how implementation would work... dump data values (plain or custom
+`pg_dump`?) then reload, probably.
+
+## Idea: `release_rev` and `file_rev` BIGINT for sub-tables
+
+`_rev` table primary keys are UUIDs (16 bytes). All child tables (eg,
+`release_contrib`, `file_rev_url`) have both foreign keys and indices on those
+foreign keys to this table.
+
+We could add new BIGINT secondary identifiers to the `_rev` tables (no index
+needed), and switch all the child tables to use that. Because the child tables
+are usually much larger than the `_rev` tables, this could save a lot of data
+and index space.
+
+It is a complex change, especially if it was only for `release`/`file`.
+
+## Idea: Refactor `_edit` tables
+
+Feels like there might be a bunch of waste here. Could make `{editgroup,
+ident}` the primary key, which would remove need for a separate identifier and
+index (we already have a UNIQ constraint on this).
+
+TODO: investigate further.