diff options
| author | Bryan Newbold <bnewbold@robocracy.org> | 2020-02-19 17:31:52 -0800 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@robocracy.org> | 2020-02-19 17:31:52 -0800 | 
| commit | 64d3c5475921a8024b083558ca96bb15e27f48c1 (patch) | |
| tree | 683d2ad123978675865401f2f348dd1b99d0e88c | |
| parent | 9cc369e19d7ba82d07be3d6b24c2526339135a0a (diff) | |
| download | fatcat-64d3c5475921a8024b083558ca96bb15e27f48c1.tar.gz fatcat-64d3c5475921a8024b083558ca96bb15e27f48c1.zip | |
more SQL db size reduction notes
| -rw-r--r-- | proposals/2020_sql_size_reduction.md | 16 | 
1 files changed, 16 insertions, 0 deletions
| diff --git a/proposals/2020_sql_size_reduction.md b/proposals/2020_sql_size_reduction.md index f421e455..2fa39873 100644 --- a/proposals/2020_sql_size_reduction.md +++ b/proposals/2020_sql_size_reduction.md @@ -52,6 +52,8 @@ Other growth is expected to be much smaller, let's say a few GB of disk.  This works out to a bit over 600 GByte total disk size. +NOTE: math was wrong? 470 + 80 + 100 -> 650 GByte, call it 700 GByte +  ## Idea: finish `ext_id` migration and drop columns+index from `release_rev` @@ -172,3 +174,17 @@ would drop ~20% of data size and ~20% of index size.  Would it make more sense to use {ident, editgroup} as the primary key and UNIQ,  then have a separate index on `editgroup`? On the assumption that `editgroup`  cardinality is much smaller, thus the index disk usage would be smaller. + +## Idea: use binary for hashes + +We currently store file hashes (SHA-1, SHA-256, MD5) and abstracts/`ref_blobs` +keys as TEXT in lower-case hex encoding. Using binary instead could be as much +as a 50% size savings for both column and index storage. The difference becomes +more apparent when all files have all hashes populated. + +base32 encoded strings would be smaller (but non-negligable) savings. + +This change has a reasonable migration path, is entirely internal to postgres +and fatcatd, and would be no change to API schema. Postgres also allows `hex` +encoding on `bytea` data type, which can make reading/debugging reasonable. + | 
