From 64d3c5475921a8024b083558ca96bb15e27f48c1 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@robocracy.org>
Date: Wed, 19 Feb 2020 17:31:52 -0800
Subject: more SQL db size reduction notes

---
 proposals/2020_sql_size_reduction.md | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

(limited to 'proposals')

diff --git a/proposals/2020_sql_size_reduction.md b/proposals/2020_sql_size_reduction.md
index f421e455..2fa39873 100644
--- a/proposals/2020_sql_size_reduction.md
+++ b/proposals/2020_sql_size_reduction.md
@@ -52,6 +52,8 @@ Other growth is expected to be much smaller, let's say a few GB of disk.
 
 This works out to a bit over 600 GByte total disk size.
 
+NOTE: math was wrong? 470 + 80 + 100 -> 650 GByte, call it 700 GByte
+
 
 ## Idea: finish `ext_id` migration and drop columns+index from `release_rev`
 
@@ -172,3 +174,17 @@ would drop ~20% of data size and ~20% of index size.
 Would it make more sense to use {ident, editgroup} as the primary key and UNIQ,
 then have a separate index on `editgroup`? On the assumption that `editgroup`
 cardinality is much smaller, thus the index disk usage would be smaller.
+
+## Idea: use binary for hashes
+
+We currently store file hashes (SHA-1, SHA-256, MD5) and abstracts/`ref_blobs`
+keys as TEXT in lower-case hex encoding. Using binary instead could be as much
+as a 50% size savings for both column and index storage. The difference becomes
+more apparent when all files have all hashes populated.
+
+base32 encoded strings would be smaller (but non-negligable) savings.
+
+This change has a reasonable migration path, is entirely internal to postgres
+and fatcatd, and would be no change to API schema. Postgres also allows `hex`
+encoding on `bytea` data type, which can make reading/debugging reasonable.
+
-- 
cgit v1.2.3