diff options
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/2020_elasticsearch_schemas.md | 12 | ||||
-rw-r--r-- | proposals/2020_sql_size_reduction.md | 16 |
2 files changed, 22 insertions, 6 deletions
diff --git a/proposals/2020_elasticsearch_schemas.md b/proposals/2020_elasticsearch_schemas.md index 83db884f..c3e79073 100644 --- a/proposals/2020_elasticsearch_schemas.md +++ b/proposals/2020_elasticsearch_schemas.md @@ -14,8 +14,6 @@ Simple additions: - pages - `first_page` (parsed from pages) (?) - number -- `in_shadow` -- OA license slug (?) - `doi_prefix` - `doi_registrar` (based on extra) - `first_author` (surname; for matching) @@ -25,6 +23,8 @@ Simple additions: - referenced releases idents - contrib creator idents +Add affiliations, both as raw strings and ROR identifiers. + ## Preservation Summary Field @@ -33,8 +33,8 @@ status (from `in_kbart`, `in_ia`, etc) to a `preservation_status` flag which is: - `bright` -- `dark_only` -- `shadow_only` +- `dark` +- `shadows_only` - `none` Note that these don't align with OA color or work-level preservation (aka, no @@ -128,8 +128,8 @@ hit does not}"). ## Container Fields -- `all_issns` -- `release_count` +- `issn` (all issns) +- `original_name` The `release_count` would not be indexed (left null) by default, and would be "patched" in to entities by a separate script (periodically?). diff --git a/proposals/2020_sql_size_reduction.md b/proposals/2020_sql_size_reduction.md index f421e455..2fa39873 100644 --- a/proposals/2020_sql_size_reduction.md +++ b/proposals/2020_sql_size_reduction.md @@ -52,6 +52,8 @@ Other growth is expected to be much smaller, let's say a few GB of disk. This works out to a bit over 600 GByte total disk size. +NOTE: math was wrong? 470 + 80 + 100 -> 650 GByte, call it 700 GByte + ## Idea: finish `ext_id` migration and drop columns+index from `release_rev` @@ -172,3 +174,17 @@ would drop ~20% of data size and ~20% of index size. Would it make more sense to use {ident, editgroup} as the primary key and UNIQ, then have a separate index on `editgroup`? On the assumption that `editgroup` cardinality is much smaller, thus the index disk usage would be smaller. + +## Idea: use binary for hashes + +We currently store file hashes (SHA-1, SHA-256, MD5) and abstracts/`ref_blobs` +keys as TEXT in lower-case hex encoding. Using binary instead could be as much +as a 50% size savings for both column and index storage. The difference becomes +more apparent when all files have all hashes populated. + +base32 encoded strings would be smaller (but non-negligable) savings. + +This change has a reasonable migration path, is entirely internal to postgres +and fatcatd, and would be no change to API schema. Postgres also allows `hex` +encoding on `bytea` data type, which can make reading/debugging reasonable. + |