summaryrefslogtreecommitdiffstats
path: root/proposals
diff options
context:
space:
mode:
Diffstat (limited to 'proposals')
-rw-r--r--proposals/2020_elasticsearch_schemas.md12
-rw-r--r--proposals/2020_sql_size_reduction.md16
2 files changed, 22 insertions, 6 deletions
diff --git a/proposals/2020_elasticsearch_schemas.md b/proposals/2020_elasticsearch_schemas.md
index 83db884f..c3e79073 100644
--- a/proposals/2020_elasticsearch_schemas.md
+++ b/proposals/2020_elasticsearch_schemas.md
@@ -14,8 +14,6 @@ Simple additions:
- pages
- `first_page` (parsed from pages) (?)
- number
-- `in_shadow`
-- OA license slug (?)
- `doi_prefix`
- `doi_registrar` (based on extra)
- `first_author` (surname; for matching)
@@ -25,6 +23,8 @@ Simple additions:
- referenced releases idents
- contrib creator idents
+Add affiliations, both as raw strings and ROR identifiers.
+
## Preservation Summary Field
@@ -33,8 +33,8 @@ status (from `in_kbart`, `in_ia`, etc) to a `preservation_status` flag which
is:
- `bright`
-- `dark_only`
-- `shadow_only`
+- `dark`
+- `shadows_only`
- `none`
Note that these don't align with OA color or work-level preservation (aka, no
@@ -128,8 +128,8 @@ hit does not}").
## Container Fields
-- `all_issns`
-- `release_count`
+- `issn` (all issns)
+- `original_name`
The `release_count` would not be indexed (left null) by default, and would be
"patched" in to entities by a separate script (periodically?).
diff --git a/proposals/2020_sql_size_reduction.md b/proposals/2020_sql_size_reduction.md
index f421e455..2fa39873 100644
--- a/proposals/2020_sql_size_reduction.md
+++ b/proposals/2020_sql_size_reduction.md
@@ -52,6 +52,8 @@ Other growth is expected to be much smaller, let's say a few GB of disk.
This works out to a bit over 600 GByte total disk size.
+NOTE: math was wrong? 470 + 80 + 100 -> 650 GByte, call it 700 GByte
+
## Idea: finish `ext_id` migration and drop columns+index from `release_rev`
@@ -172,3 +174,17 @@ would drop ~20% of data size and ~20% of index size.
Would it make more sense to use {ident, editgroup} as the primary key and UNIQ,
then have a separate index on `editgroup`? On the assumption that `editgroup`
cardinality is much smaller, thus the index disk usage would be smaller.
+
+## Idea: use binary for hashes
+
+We currently store file hashes (SHA-1, SHA-256, MD5) and abstracts/`ref_blobs`
+keys as TEXT in lower-case hex encoding. Using binary instead could be as much
+as a 50% size savings for both column and index storage. The difference becomes
+more apparent when all files have all hashes populated.
+
+base32 encoded strings would be smaller (but non-negligable) savings.
+
+This change has a reasonable migration path, is entirely internal to postgres
+and fatcatd, and would be no change to API schema. Postgres also allows `hex`
+encoding on `bytea` data type, which can make reading/debugging reasonable.
+