From 8da0e3aeb6cb90a489c6a27b50e7f79e539e5a38 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 14 May 2019 12:03:24 -0700 Subject: commit some proposals (impl in progress) --- proposals/20190509_schema_tweaks.md | 142 ++++++++++++++++++++++++++++++++++ proposals/20190510_release_ext_ids.md | 87 +++++++++++++++++++++ 2 files changed, 229 insertions(+) create mode 100644 proposals/20190509_schema_tweaks.md create mode 100644 proposals/20190510_release_ext_ids.md (limited to 'proposals') diff --git a/proposals/20190509_schema_tweaks.md b/proposals/20190509_schema_tweaks.md new file mode 100644 index 00000000..7e372959 --- /dev/null +++ b/proposals/20190509_schema_tweaks.md @@ -0,0 +1,142 @@ + +# SQL (and API) schema changes + +Intend to make these changes at the same time as bumping OpenAPI schema from +0.2 to 0.3, along with `20190510_editgroup_endpoint_prefix` and +`20190510_release_ext_ids`. + +Also adding some indices to speed up entity edit history views, but those are +just a performance change, not visible in API schema. + +### Structured Contrib Names + +`creator` entities already have "structured" names: in addition to +`display_name`, there are `given_name` and `surname` fields. This change is to +add these two fields to release contribs as well (to join `raw_name`). + +The two main motivations are: + +1. make various representations (eg, citation formats) of release entities + easier. CSL and many display formats require given/surname distinctions +2. improve algorithmic matching between release entities, raw metadata (eg, + from GROBID), and citation strings. Eg, biblio-glutton wants "first author + surname"; we can't provide this from existing `raw_name` field + +The status quo is that many large metadata sources often include structured +names, and we munge them into a single name. + +Some arguments against this change are: + +1. should be "normalizing" this structure into creator entities. However, + display/representation of a contributor might change between publications +2. structure isn't always deterministic from what is visible in published + documents. AKA, raw name is unambiguous (it's what is "printed" on the + document), but given/sur decomposition can be ambiguous (for individauls, or + entire locales/cultures) +3. could just stash in contrib `extra_json`. However, seems common enough to + include as full fields + +Questions/Decisions: + +- should contrib `raw_name` be changed to `display_name` for consistency with + `creator`? `raw_name` should probably always be what is in/on the document + itself, thus no. +- should we still munge a `raw_name` at insert time (we we only have structured + names), or push this on to client code to always create something for + display? + +### Rename `release_status` to `release_stage` + +Describes the field better. I think this is uncontroversial and not too +disruptive at this point. + +### New release fields: subtitle, number, version + +`subtitle`: mostly for books. could have a flat-out style guide policy against +use for articles? Already frequently add subtitle metadata as an `extra_json` +field. + +`number`: intended to represent, eg, a report number ("RFC ..."). Not to be +confused with `container-number`, `chapter`, `edition` + +`version`: intended to be a short string ("v3", "2", "third", "3.9") to +disambiguate which among multiple versions. CSL has a separate `edition` field. + +These are somewhat hard to justify as dedicated fields vs. `extra_json`. + +`subtitle` is a pretty core field for book metadata, but raises ambiguity for +other release types. + +Excited to include many reports and memos (as grey lit), for which the number +is a pretty major field, and we probably want to include in elasticsearch but +not as part of the title field, and someday perhaps an index on `number`, so +that's easier to justify. + +TODO: + +- `version` maybe should be dropped. arXiv is one possible justification, as is + sorting by this field in display. + +### Withdrawn fields + +As part of a plan to represent retractions and other "unpublishing", decided to +track when and whether a release has been "withdrawn", distinct from the +`release_stage`. + +To motivate this, consider a work that has been retracted. There are multiple +releases of different stages; should not set the `release_stage` for all to +`withdrawn` or `retracted`, because then hard to disambiguate between the +release entities. Also maybe the pre-print hasn't been formally withdrawn and +is still in the pre-print server, or maybe only the pre-print was withdrawn +(for being partial/incorrect?) while the final version is still "active". + +As with `release_date`, just `withdrawn_date` is insufficient, so we get +`withdrawn_year` also... and `withdrawn_month` in the future? Also +`withdrawn_state` for cases where we don't know even the year. This could +probably be a bool (`is_withdrawn` or `withdrawn`), but the flexibility of a +TEXT/ENUM has been nice. + +TODO: + +- boolean (`is_withdrawn`, default False) or text (`withdrawn_status`). Let's + keep text to allow evolution in the future; if the field is defined at all + it's "withdrawn" (true), if not it isn't + +### New release extids: `mag_id`, `ark_id` + +See also: `20190510_release_ext_ids`. + +- `mag_id`: Microsoft Academic Graph identifier. +- `ark_id`: ARK identifier. + +These will likely be the last identifiers added as fields on `release`; a +future two-stage refactor will be to move these out to a child table (something +like `extid_type`, `extid_value`, with a UNIQ index for lookups). + +Perhaps the `extid` table should be implemented now, starting with these +identifiers? + +### Web Capture CDX `size_bytes` + +Pretty straight-forward. + +Considered adding `extra_json` as well, to be consistent with other tables, but +feels too heavy for the CDX case. Can add later if there is an actual need; +adding fields easier than removing (for backwards compat). + +### Object/Class Name Changes + +TODO + +### Rust/Python Library Name Changes + +Do these as separate commits, after merging back in to master, for v0.3: + +- rust `fatcat-api-spec` => `fatcat-openapi` +- python `fatcat_client` => `fatcat_openapi_client` + +### More? + +`release_month`: apprently pretty common to know the year and month but not +date. I have avoided so far, seems like unnecessary complexity. Could start +as an `extra_json` field? diff --git a/proposals/20190510_release_ext_ids.md b/proposals/20190510_release_ext_ids.md new file mode 100644 index 00000000..1d2b912a --- /dev/null +++ b/proposals/20190510_release_ext_ids.md @@ -0,0 +1,87 @@ + +# Release External ID Refactor + +Goal is to make the external identifier "namespace" (number of external +identifiers supported) less scarce, while still allowing fast lookups. Adding +new identifiers would still require API schema and rust code changes. + +This change would also bring the number of columns on `release_rev` back under +32, which makes diesel happier. + +Unclear whether existing extids (like DOI) should get merged and the old +columns dropped. This could be done consistently by, eg, a rust worker process +that re-writes tables, probably in multiple stages (eg, first copy, then update +code to read from new location, then drop old columns and indices). Perhaps for +very hot/popular columns (like DOI, and maybe ISBN13?) it is better to have +separate columns/indices. + +It would be possible to roll this out as a SQL change first, with identifiers +still at the top level of the API schema, then later switch the API schema. Not +sure this is worth it though. + +## New API + +All identifers as text + + release_entity + ext_ids (required) + doi + pmid + pmcid + wikidata_qid + core + isbn13 + arxiv + jstor + [...] + +## New SQL Schema + +Something like: + + release_rev_extid ( + release_rev_id UUID foreign key to release_rev.id + extid_type TEXT + extid_value TEXT + ) + PRIMARY KEY (release_rev_id, extid_type) + INDEX (extid_type, extid_value) + +### Existing prod Column Use + +Queries like: + + fatcat_prod=# SELECT COUNT(*) FROM release_rev WHERE wikidata_qid IS NOT NULL; + 13460723 + +Results: + + wikidata_qid: 13460723 + isbn13: 1 + core_id: 7160477 + arxiv_id: 3 + jstor_id: 0 + pmcid: 3688945 + +Keep in SQL: + +- `doi` +- `pmid` +- `pmcid` +- `wikidata_qid` +- `core_id` + +Drop columns: + +- `isbn13` +- `arxiv_id` +- `jstor_id` + +In new table: + +- isbn13 +- arxiv +- jstor +- mag +- ark +- dblp -- cgit v1.2.3