diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2020-01-03 16:05:07 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2020-01-03 16:05:07 -0800 |
commit | 3a57c35ddcf794d7211d1649e74a9917bd1c9495 (patch) | |
tree | 48f0c420d2048eeeec4add0b4523cb6a0a14dcfe /proposals/20190509_schema_tweaks.md | |
parent | d5e2af24cb6563eca91407425cea9b808a7d691c (diff) | |
download | fatcat-3a57c35ddcf794d7211d1649e74a9917bd1c9495.tar.gz fatcat-3a57c35ddcf794d7211d1649e74a9917bd1c9495.zip |
proposals: standardize a bit
Diffstat (limited to 'proposals/20190509_schema_tweaks.md')
-rw-r--r-- | proposals/20190509_schema_tweaks.md | 142 |
1 files changed, 0 insertions, 142 deletions
diff --git a/proposals/20190509_schema_tweaks.md b/proposals/20190509_schema_tweaks.md deleted file mode 100644 index 7e372959..00000000 --- a/proposals/20190509_schema_tweaks.md +++ /dev/null @@ -1,142 +0,0 @@ - -# SQL (and API) schema changes - -Intend to make these changes at the same time as bumping OpenAPI schema from -0.2 to 0.3, along with `20190510_editgroup_endpoint_prefix` and -`20190510_release_ext_ids`. - -Also adding some indices to speed up entity edit history views, but those are -just a performance change, not visible in API schema. - -### Structured Contrib Names - -`creator` entities already have "structured" names: in addition to -`display_name`, there are `given_name` and `surname` fields. This change is to -add these two fields to release contribs as well (to join `raw_name`). - -The two main motivations are: - -1. make various representations (eg, citation formats) of release entities - easier. CSL and many display formats require given/surname distinctions -2. improve algorithmic matching between release entities, raw metadata (eg, - from GROBID), and citation strings. Eg, biblio-glutton wants "first author - surname"; we can't provide this from existing `raw_name` field - -The status quo is that many large metadata sources often include structured -names, and we munge them into a single name. - -Some arguments against this change are: - -1. should be "normalizing" this structure into creator entities. However, - display/representation of a contributor might change between publications -2. structure isn't always deterministic from what is visible in published - documents. AKA, raw name is unambiguous (it's what is "printed" on the - document), but given/sur decomposition can be ambiguous (for individauls, or - entire locales/cultures) -3. could just stash in contrib `extra_json`. However, seems common enough to - include as full fields - -Questions/Decisions: - -- should contrib `raw_name` be changed to `display_name` for consistency with - `creator`? `raw_name` should probably always be what is in/on the document - itself, thus no. -- should we still munge a `raw_name` at insert time (we we only have structured - names), or push this on to client code to always create something for - display? - -### Rename `release_status` to `release_stage` - -Describes the field better. I think this is uncontroversial and not too -disruptive at this point. - -### New release fields: subtitle, number, version - -`subtitle`: mostly for books. could have a flat-out style guide policy against -use for articles? Already frequently add subtitle metadata as an `extra_json` -field. - -`number`: intended to represent, eg, a report number ("RFC ..."). Not to be -confused with `container-number`, `chapter`, `edition` - -`version`: intended to be a short string ("v3", "2", "third", "3.9") to -disambiguate which among multiple versions. CSL has a separate `edition` field. - -These are somewhat hard to justify as dedicated fields vs. `extra_json`. - -`subtitle` is a pretty core field for book metadata, but raises ambiguity for -other release types. - -Excited to include many reports and memos (as grey lit), for which the number -is a pretty major field, and we probably want to include in elasticsearch but -not as part of the title field, and someday perhaps an index on `number`, so -that's easier to justify. - -TODO: - -- `version` maybe should be dropped. arXiv is one possible justification, as is - sorting by this field in display. - -### Withdrawn fields - -As part of a plan to represent retractions and other "unpublishing", decided to -track when and whether a release has been "withdrawn", distinct from the -`release_stage`. - -To motivate this, consider a work that has been retracted. There are multiple -releases of different stages; should not set the `release_stage` for all to -`withdrawn` or `retracted`, because then hard to disambiguate between the -release entities. Also maybe the pre-print hasn't been formally withdrawn and -is still in the pre-print server, or maybe only the pre-print was withdrawn -(for being partial/incorrect?) while the final version is still "active". - -As with `release_date`, just `withdrawn_date` is insufficient, so we get -`withdrawn_year` also... and `withdrawn_month` in the future? Also -`withdrawn_state` for cases where we don't know even the year. This could -probably be a bool (`is_withdrawn` or `withdrawn`), but the flexibility of a -TEXT/ENUM has been nice. - -TODO: - -- boolean (`is_withdrawn`, default False) or text (`withdrawn_status`). Let's - keep text to allow evolution in the future; if the field is defined at all - it's "withdrawn" (true), if not it isn't - -### New release extids: `mag_id`, `ark_id` - -See also: `20190510_release_ext_ids`. - -- `mag_id`: Microsoft Academic Graph identifier. -- `ark_id`: ARK identifier. - -These will likely be the last identifiers added as fields on `release`; a -future two-stage refactor will be to move these out to a child table (something -like `extid_type`, `extid_value`, with a UNIQ index for lookups). - -Perhaps the `extid` table should be implemented now, starting with these -identifiers? - -### Web Capture CDX `size_bytes` - -Pretty straight-forward. - -Considered adding `extra_json` as well, to be consistent with other tables, but -feels too heavy for the CDX case. Can add later if there is an actual need; -adding fields easier than removing (for backwards compat). - -### Object/Class Name Changes - -TODO - -### Rust/Python Library Name Changes - -Do these as separate commits, after merging back in to master, for v0.3: - -- rust `fatcat-api-spec` => `fatcat-openapi` -- python `fatcat_client` => `fatcat_openapi_client` - -### More? - -`release_month`: apprently pretty common to know the year and month but not -date. I have avoided so far, seems like unnecessary complexity. Could start -as an `extra_json` field? |