proposals: standardize a bit

author: Bryan Newbold <bnewbold@robocracy.org> 2020-01-03 16:05:07 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2020-01-03 16:05:07 -0800
commit: 3a57c35ddcf794d7211d1649e74a9917bd1c9495 (patch)
tree: 48f0c420d2048eeeec4add0b4523cb6a0a14dcfe /proposals/20190509_schema_tweaks.md
parent: d5e2af24cb6563eca91407425cea9b808a7d691c (diff)
download: fatcat-3a57c35ddcf794d7211d1649e74a9917bd1c9495.tar.gz
fatcat-3a57c35ddcf794d7211d1649e74a9917bd1c9495.zip
1 files changed, 0 insertions, 142 deletions
diff --git a/proposals/20190509_schema_tweaks.md b/proposals/20190509_schema_tweaks.md
deleted file mode 100644
index 7e372959..00000000
--- a/proposals/20190509_schema_tweaks.md
+++ /dev/null
@@ -1,142 +0,0 @@
-
-# SQL (and API) schema changes
-
-Intend to make these changes at the same time as bumping OpenAPI schema from
-0.2 to 0.3, along with `20190510_editgroup_endpoint_prefix` and
-`20190510_release_ext_ids`.
-
-Also adding some indices to speed up entity edit history views, but those are
-just a performance change, not visible in API schema.
-
-### Structured Contrib Names
-
-`creator` entities already have "structured" names: in addition to
-`display_name`, there are `given_name` and `surname` fields. This change is to
-add these two fields to release contribs as well (to join `raw_name`).
-
-The two main motivations are:
-
-1. make various representations (eg, citation formats) of release entities
-   easier. CSL and many display formats require given/surname distinctions
-2. improve algorithmic matching between release entities, raw metadata (eg,
-   from GROBID), and citation strings. Eg, biblio-glutton wants "first author
-   surname"; we can't provide this from existing `raw_name` field
-
-The status quo is that many large metadata sources often include structured
-names, and we munge them into a single name.
-
-Some arguments against this change are:
-
-1. should be "normalizing" this structure into creator entities. However,
-   display/representation of a contributor might change between publications
-2. structure isn't always deterministic from what is visible in published
-   documents. AKA, raw name is unambiguous (it's what is "printed" on the
-   document), but given/sur decomposition can be ambiguous (for individauls, or
-   entire locales/cultures)
-3. could just stash in contrib `extra_json`. However, seems common enough to
-   include as full fields
-
-Questions/Decisions:
-
-- should contrib `raw_name` be changed to `display_name` for consistency with
-  `creator`? `raw_name` should probably always be what is in/on the document
-  itself, thus no.
-- should we still munge a `raw_name` at insert time (we we only have structured
-  names), or push this on to client code to always create something for
-  display?
-
-### Rename `release_status` to `release_stage`
-
-Describes the field better. I think this is uncontroversial and not too
-disruptive at this point.
-
-### New release fields: subtitle, number, version
-
-`subtitle`: mostly for books. could have a flat-out style guide policy against
-use for articles? Already frequently add subtitle metadata as an `extra_json`
-field.
-
-`number`: intended to represent, eg, a report number ("RFC ..."). Not to be
-confused with `container-number`, `chapter`, `edition`
-
-`version`: intended to be a short string ("v3", "2", "third", "3.9") to
-disambiguate which among multiple versions. CSL has a separate `edition` field.
-
-These are somewhat hard to justify as dedicated fields vs. `extra_json`.
-
-`subtitle` is a pretty core field for book metadata, but raises ambiguity for
-other release types.
-
-Excited to include many reports and memos (as grey lit), for which the number
-is a pretty major field, and we probably want to include in elasticsearch but
-not as part of the title field, and someday perhaps an index on `number`, so
-that's easier to justify.
-
-TODO:
-
-- `version` maybe should be dropped. arXiv is one possible justification, as is
-  sorting by this field in display.
-
-### Withdrawn fields
-
-As part of a plan to represent retractions and other "unpublishing", decided to
-track when and whether a release has been "withdrawn", distinct from the
-`release_stage`.
-
-To motivate this, consider a work that has been retracted. There are multiple
-releases of different stages; should not set the `release_stage` for all to
-`withdrawn` or `retracted`, because then hard to disambiguate between the
-release entities. Also maybe the pre-print hasn't been formally withdrawn and
-is still in the pre-print server, or maybe only the pre-print was withdrawn
-(for being partial/incorrect?) while the final version is still "active".
-
-As with `release_date`, just `withdrawn_date` is insufficient, so we get
-`withdrawn_year` also...  and `withdrawn_month` in the future? Also
-`withdrawn_state` for cases where we don't know even the year. This could
-probably be a bool (`is_withdrawn` or `withdrawn`), but the flexibility of a
-TEXT/ENUM has been nice.
-
-TODO:
-
-- boolean (`is_withdrawn`, default False) or text (`withdrawn_status`). Let's
-  keep text to allow evolution in the future; if the field is defined at all
-  it's "withdrawn" (true), if not it isn't
-
-### New release extids: `mag_id`, `ark_id`
-
-See also: `20190510_release_ext_ids`.
-
-- `mag_id`: Microsoft Academic Graph identifier.
-- `ark_id`: ARK identifier.
-
-These will likely be the last identifiers added as fields on `release`; a
-future two-stage refactor will be to move these out to a child table (something
-like `extid_type`, `extid_value`, with a UNIQ index for lookups).
-
-Perhaps the `extid` table should be implemented now, starting with these
-identifiers?
-
-### Web Capture CDX `size_bytes`
-
-Pretty straight-forward. 
-
-Considered adding `extra_json` as well, to be consistent with other tables, but
-feels too heavy for the CDX case. Can add later if there is an actual need;
-adding fields easier than removing (for backwards compat).
-
-### Object/Class Name Changes
-
-TODO
-
-### Rust/Python Library Name Changes
-
-Do these as separate commits, after merging back in to master, for v0.3:
-
-- rust `fatcat-api-spec` => `fatcat-openapi`
-- python `fatcat_client` => `fatcat_openapi_client`
-
-### More?
-
-`release_month`: apprently pretty common to know the year and month but not
-date. I have avoided so far, seems like unnecessary complexity. Could start
-as an `extra_json` field?
author	Bryan Newbold <bnewbold@robocracy.org>	2020-01-03 16:05:07 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2020-01-03 16:05:07 -0800
commit	3a57c35ddcf794d7211d1649e74a9917bd1c9495 (patch)
tree	48f0c420d2048eeeec4add0b4523cb6a0a14dcfe /proposals/20190509_schema_tweaks.md
parent	d5e2af24cb6563eca91407425cea9b808a7d691c (diff)
download	fatcat-3a57c35ddcf794d7211d1649e74a9917bd1c9495.tar.gz fatcat-3a57c35ddcf794d7211d1649e74a9917bd1c9495.zip