From 8da0e3aeb6cb90a489c6a27b50e7f79e539e5a38 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 14 May 2019 12:03:24 -0700 Subject: commit some proposals (impl in progress) --- proposals/20190510_release_ext_ids.md | 87 +++++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 proposals/20190510_release_ext_ids.md (limited to 'proposals/20190510_release_ext_ids.md') diff --git a/proposals/20190510_release_ext_ids.md b/proposals/20190510_release_ext_ids.md new file mode 100644 index 00000000..1d2b912a --- /dev/null +++ b/proposals/20190510_release_ext_ids.md @@ -0,0 +1,87 @@ + +# Release External ID Refactor + +Goal is to make the external identifier "namespace" (number of external +identifiers supported) less scarce, while still allowing fast lookups. Adding +new identifiers would still require API schema and rust code changes. + +This change would also bring the number of columns on `release_rev` back under +32, which makes diesel happier. + +Unclear whether existing extids (like DOI) should get merged and the old +columns dropped. This could be done consistently by, eg, a rust worker process +that re-writes tables, probably in multiple stages (eg, first copy, then update +code to read from new location, then drop old columns and indices). Perhaps for +very hot/popular columns (like DOI, and maybe ISBN13?) it is better to have +separate columns/indices. + +It would be possible to roll this out as a SQL change first, with identifiers +still at the top level of the API schema, then later switch the API schema. Not +sure this is worth it though. + +## New API + +All identifers as text + + release_entity + ext_ids (required) + doi + pmid + pmcid + wikidata_qid + core + isbn13 + arxiv + jstor + [...] + +## New SQL Schema + +Something like: + + release_rev_extid ( + release_rev_id UUID foreign key to release_rev.id + extid_type TEXT + extid_value TEXT + ) + PRIMARY KEY (release_rev_id, extid_type) + INDEX (extid_type, extid_value) + +### Existing prod Column Use + +Queries like: + + fatcat_prod=# SELECT COUNT(*) FROM release_rev WHERE wikidata_qid IS NOT NULL; + 13460723 + +Results: + + wikidata_qid: 13460723 + isbn13: 1 + core_id: 7160477 + arxiv_id: 3 + jstor_id: 0 + pmcid: 3688945 + +Keep in SQL: + +- `doi` +- `pmid` +- `pmcid` +- `wikidata_qid` +- `core_id` + +Drop columns: + +- `isbn13` +- `arxiv_id` +- `jstor_id` + +In new table: + +- isbn13 +- arxiv +- jstor +- mag +- ark +- dblp -- cgit v1.2.3