commit some proposals (impl in progress)

author: Bryan Newbold <bnewbold@robocracy.org> 2019-05-14 12:03:24 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2019-05-14 12:03:24 -0700
commit: 8da0e3aeb6cb90a489c6a27b50e7f79e539e5a38 (patch)
tree: 6bc8cb61faf0b8b660ba8dc00919f6bd8cb5527c
parent: e63e21495c5e63f9fce3a0204f178d104c46124e (diff)
download: fatcat-8da0e3aeb6cb90a489c6a27b50e7f79e539e5a38.tar.gz
fatcat-8da0e3aeb6cb90a489c6a27b50e7f79e539e5a38.zip
2 files changed, 229 insertions, 0 deletions
diff --git a/proposals/20190509_schema_tweaks.md b/proposals/20190509_schema_tweaks.md
new file mode 100644
index 00000000..7e372959
--- /dev/null
+++ b/proposals/20190509_schema_tweaks.md
@@ -0,0 +1,142 @@
+
+# SQL (and API) schema changes
+
+Intend to make these changes at the same time as bumping OpenAPI schema from
+0.2 to 0.3, along with `20190510_editgroup_endpoint_prefix` and
+`20190510_release_ext_ids`.
+
+Also adding some indices to speed up entity edit history views, but those are
+just a performance change, not visible in API schema.
+
+### Structured Contrib Names
+
+`creator` entities already have "structured" names: in addition to
+`display_name`, there are `given_name` and `surname` fields. This change is to
+add these two fields to release contribs as well (to join `raw_name`).
+
+The two main motivations are:
+
+1. make various representations (eg, citation formats) of release entities
+   easier. CSL and many display formats require given/surname distinctions
+2. improve algorithmic matching between release entities, raw metadata (eg,
+   from GROBID), and citation strings. Eg, biblio-glutton wants "first author
+   surname"; we can't provide this from existing `raw_name` field
+
+The status quo is that many large metadata sources often include structured
+names, and we munge them into a single name.
+
+Some arguments against this change are:
+
+1. should be "normalizing" this structure into creator entities. However,
+   display/representation of a contributor might change between publications
+2. structure isn't always deterministic from what is visible in published
+   documents. AKA, raw name is unambiguous (it's what is "printed" on the
+   document), but given/sur decomposition can be ambiguous (for individauls, or
+   entire locales/cultures)
+3. could just stash in contrib `extra_json`. However, seems common enough to
+   include as full fields
+
+Questions/Decisions:
+
+- should contrib `raw_name` be changed to `display_name` for consistency with
+  `creator`? `raw_name` should probably always be what is in/on the document
+  itself, thus no.
+- should we still munge a `raw_name` at insert time (we we only have structured
+  names), or push this on to client code to always create something for
+  display?
+
+### Rename `release_status` to `release_stage`
+
+Describes the field better. I think this is uncontroversial and not too
+disruptive at this point.
+
+### New release fields: subtitle, number, version
+
+`subtitle`: mostly for books. could have a flat-out style guide policy against
+use for articles? Already frequently add subtitle metadata as an `extra_json`
+field.
+
+`number`: intended to represent, eg, a report number ("RFC ..."). Not to be
+confused with `container-number`, `chapter`, `edition`
+
+`version`: intended to be a short string ("v3", "2", "third", "3.9") to
+disambiguate which among multiple versions. CSL has a separate `edition` field.
+
+These are somewhat hard to justify as dedicated fields vs. `extra_json`.
+
+`subtitle` is a pretty core field for book metadata, but raises ambiguity for
+other release types.
+
+Excited to include many reports and memos (as grey lit), for which the number
+is a pretty major field, and we probably want to include in elasticsearch but
+not as part of the title field, and someday perhaps an index on `number`, so
+that's easier to justify.
+
+TODO:
+
+- `version` maybe should be dropped. arXiv is one possible justification, as is
+  sorting by this field in display.
+
+### Withdrawn fields
+
+As part of a plan to represent retractions and other "unpublishing", decided to
+track when and whether a release has been "withdrawn", distinct from the
+`release_stage`.
+
+To motivate this, consider a work that has been retracted. There are multiple
+releases of different stages; should not set the `release_stage` for all to
+`withdrawn` or `retracted`, because then hard to disambiguate between the
+release entities. Also maybe the pre-print hasn't been formally withdrawn and
+is still in the pre-print server, or maybe only the pre-print was withdrawn
+(for being partial/incorrect?) while the final version is still "active".
+
+As with `release_date`, just `withdrawn_date` is insufficient, so we get
+`withdrawn_year` also...  and `withdrawn_month` in the future? Also
+`withdrawn_state` for cases where we don't know even the year. This could
+probably be a bool (`is_withdrawn` or `withdrawn`), but the flexibility of a
+TEXT/ENUM has been nice.
+
+TODO:
+
+- boolean (`is_withdrawn`, default False) or text (`withdrawn_status`). Let's
+  keep text to allow evolution in the future; if the field is defined at all
+  it's "withdrawn" (true), if not it isn't
+
+### New release extids: `mag_id`, `ark_id`
+
+See also: `20190510_release_ext_ids`.
+
+- `mag_id`: Microsoft Academic Graph identifier.
+- `ark_id`: ARK identifier.
+
+These will likely be the last identifiers added as fields on `release`; a
+future two-stage refactor will be to move these out to a child table (something
+like `extid_type`, `extid_value`, with a UNIQ index for lookups).
+
+Perhaps the `extid` table should be implemented now, starting with these
+identifiers?
+
+### Web Capture CDX `size_bytes`
+
+Pretty straight-forward. 
+
+Considered adding `extra_json` as well, to be consistent with other tables, but
+feels too heavy for the CDX case. Can add later if there is an actual need;
+adding fields easier than removing (for backwards compat).
+
+### Object/Class Name Changes
+
+TODO
+
+### Rust/Python Library Name Changes
+
+Do these as separate commits, after merging back in to master, for v0.3:
+
+- rust `fatcat-api-spec` => `fatcat-openapi`
+- python `fatcat_client` => `fatcat_openapi_client`
+
+### More?
+
+`release_month`: apprently pretty common to know the year and month but not
+date. I have avoided so far, seems like unnecessary complexity. Could start
+as an `extra_json` field?
diff --git a/proposals/20190510_release_ext_ids.md b/proposals/20190510_release_ext_ids.md
new file mode 100644
index 00000000..1d2b912a
--- /dev/null
+++ b/proposals/20190510_release_ext_ids.md
@@ -0,0 +1,87 @@
+
+# Release External ID Refactor
+
+Goal is to make the external identifier "namespace" (number of external
+identifiers supported) less scarce, while still allowing fast lookups. Adding
+new identifiers would still require API schema and rust code changes.
+
+This change would also bring the number of columns on `release_rev` back under
+32, which makes diesel happier.
+
+Unclear whether existing extids (like DOI) should get merged and the old
+columns dropped. This could be done consistently by, eg, a rust worker process
+that re-writes tables, probably in multiple stages (eg, first copy, then update
+code to read from new location, then drop old columns and indices). Perhaps for
+very hot/popular columns (like DOI, and maybe ISBN13?) it is better to have
+separate columns/indices.
+
+It would be possible to roll this out as a SQL change first, with identifiers
+still at the top level of the API schema, then later switch the API schema. Not
+sure this is worth it though.
+
+## New API
+
+All identifers as text
+
+    release_entity
+        ext_ids (required)
+            doi
+            pmid
+            pmcid
+            wikidata_qid
+            core
+            isbn13
+            arxiv
+            jstor
+            [...]
+
+## New SQL Schema
+
+Something like:
+
+    release_rev_extid (
+        release_rev_id  UUID foreign key to release_rev.id
+        extid_type      TEXT
+        extid_value     TEXT
+    )
+    PRIMARY KEY (release_rev_id, extid_type)
+    INDEX (extid_type, extid_value)
+
+### Existing prod Column Use
+
+Queries like:
+
+    fatcat_prod=# SELECT COUNT(*) FROM release_rev WHERE wikidata_qid IS NOT NULL;
+    13460723
+
+Results:
+
+    wikidata_qid: 13460723
+    isbn13: 1
+    core_id: 7160477
+    arxiv_id: 3
+    jstor_id: 0
+    pmcid: 3688945
+
+Keep in SQL:
+
+- `doi`
+- `pmid`
+- `pmcid`
+- `wikidata_qid`
+- `core_id`
+
+Drop columns:
+
+- `isbn13`
+- `arxiv_id`
+- `jstor_id`
+
+In new table:
+
+- isbn13
+- arxiv
+- jstor
+- mag
+- ark
+- dblp
author	Bryan Newbold <bnewbold@robocracy.org>	2019-05-14 12:03:24 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2019-05-14 12:03:24 -0700
commit	8da0e3aeb6cb90a489c6a27b50e7f79e539e5a38 (patch)
tree	6bc8cb61faf0b8b660ba8dc00919f6bd8cb5527c
parent	e63e21495c5e63f9fce3a0204f178d104c46124e (diff)
download	fatcat-8da0e3aeb6cb90a489c6a27b50e7f79e539e5a38.tar.gz fatcat-8da0e3aeb6cb90a489c6a27b50e7f79e539e5a38.zip