From 3a57c35ddcf794d7211d1649e74a9917bd1c9495 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@robocracy.org>
Date: Fri, 3 Jan 2020 16:05:07 -0800
Subject: proposals: standardize a bit

---
 proposals/20190509_schema_tweaks.md             | 142 -----------------------
 proposals/20190509_v03_schema_tweaks.md         | 144 ++++++++++++++++++++++++
 proposals/20190510_editgroup_endpoint_prefix.md |   2 +
 proposals/20190510_release_ext_ids.md           |   2 +
 proposals/20190514_fatcat_identifiers.md        |   2 +
 proposals/20190911_search_query_parsing.md      |   8 +-
 proposals/20190911_v04_schema_tweaks.md         |   4 +-
 proposals/20191018_bigger_db.md                 |   4 +
 proposals/20200103_py37_refactors.md            | 101 +++++++++++++++++
 proposals/2020_py37_refactors.md                | 101 -----------------
 proposals/README.md                             |  11 ++
 11 files changed, 276 insertions(+), 245 deletions(-)
 delete mode 100644 proposals/20190509_schema_tweaks.md
 create mode 100644 proposals/20190509_v03_schema_tweaks.md
 create mode 100644 proposals/20200103_py37_refactors.md
 delete mode 100644 proposals/2020_py37_refactors.md
 create mode 100644 proposals/README.md

(limited to 'proposals')

diff --git a/proposals/20190509_schema_tweaks.md b/proposals/20190509_schema_tweaks.md
deleted file mode 100644
index 7e372959..00000000
--- a/proposals/20190509_schema_tweaks.md
+++ /dev/null
@@ -1,142 +0,0 @@
-
-# SQL (and API) schema changes
-
-Intend to make these changes at the same time as bumping OpenAPI schema from
-0.2 to 0.3, along with `20190510_editgroup_endpoint_prefix` and
-`20190510_release_ext_ids`.
-
-Also adding some indices to speed up entity edit history views, but those are
-just a performance change, not visible in API schema.
-
-### Structured Contrib Names
-
-`creator` entities already have "structured" names: in addition to
-`display_name`, there are `given_name` and `surname` fields. This change is to
-add these two fields to release contribs as well (to join `raw_name`).
-
-The two main motivations are:
-
-1. make various representations (eg, citation formats) of release entities
-   easier. CSL and many display formats require given/surname distinctions
-2. improve algorithmic matching between release entities, raw metadata (eg,
-   from GROBID), and citation strings. Eg, biblio-glutton wants "first author
-   surname"; we can't provide this from existing `raw_name` field
-
-The status quo is that many large metadata sources often include structured
-names, and we munge them into a single name.
-
-Some arguments against this change are:
-
-1. should be "normalizing" this structure into creator entities. However,
-   display/representation of a contributor might change between publications
-2. structure isn't always deterministic from what is visible in published
-   documents. AKA, raw name is unambiguous (it's what is "printed" on the
-   document), but given/sur decomposition can be ambiguous (for individauls, or
-   entire locales/cultures)
-3. could just stash in contrib `extra_json`. However, seems common enough to
-   include as full fields
-
-Questions/Decisions:
-
-- should contrib `raw_name` be changed to `display_name` for consistency with
-  `creator`? `raw_name` should probably always be what is in/on the document
-  itself, thus no.
-- should we still munge a `raw_name` at insert time (we we only have structured
-  names), or push this on to client code to always create something for
-  display?
-
-### Rename `release_status` to `release_stage`
-
-Describes the field better. I think this is uncontroversial and not too
-disruptive at this point.
-
-### New release fields: subtitle, number, version
-
-`subtitle`: mostly for books. could have a flat-out style guide policy against
-use for articles? Already frequently add subtitle metadata as an `extra_json`
-field.
-
-`number`: intended to represent, eg, a report number ("RFC ..."). Not to be
-confused with `container-number`, `chapter`, `edition`
-
-`version`: intended to be a short string ("v3", "2", "third", "3.9") to
-disambiguate which among multiple versions. CSL has a separate `edition` field.
-
-These are somewhat hard to justify as dedicated fields vs. `extra_json`.
-
-`subtitle` is a pretty core field for book metadata, but raises ambiguity for
-other release types.
-
-Excited to include many reports and memos (as grey lit), for which the number
-is a pretty major field, and we probably want to include in elasticsearch but
-not as part of the title field, and someday perhaps an index on `number`, so
-that's easier to justify.
-
-TODO:
-
-- `version` maybe should be dropped. arXiv is one possible justification, as is
-  sorting by this field in display.
-
-### Withdrawn fields
-
-As part of a plan to represent retractions and other "unpublishing", decided to
-track when and whether a release has been "withdrawn", distinct from the
-`release_stage`.
-
-To motivate this, consider a work that has been retracted. There are multiple
-releases of different stages; should not set the `release_stage` for all to
-`withdrawn` or `retracted`, because then hard to disambiguate between the
-release entities. Also maybe the pre-print hasn't been formally withdrawn and
-is still in the pre-print server, or maybe only the pre-print was withdrawn
-(for being partial/incorrect?) while the final version is still "active".
-
-As with `release_date`, just `withdrawn_date` is insufficient, so we get
-`withdrawn_year` also...  and `withdrawn_month` in the future? Also
-`withdrawn_state` for cases where we don't know even the year. This could
-probably be a bool (`is_withdrawn` or `withdrawn`), but the flexibility of a
-TEXT/ENUM has been nice.
-
-TODO:
-
-- boolean (`is_withdrawn`, default False) or text (`withdrawn_status`). Let's
-  keep text to allow evolution in the future; if the field is defined at all
-  it's "withdrawn" (true), if not it isn't
-
-### New release extids: `mag_id`, `ark_id`
-
-See also: `20190510_release_ext_ids`.
-
-- `mag_id`: Microsoft Academic Graph identifier.
-- `ark_id`: ARK identifier.
-
-These will likely be the last identifiers added as fields on `release`; a
-future two-stage refactor will be to move these out to a child table (something
-like `extid_type`, `extid_value`, with a UNIQ index for lookups).
-
-Perhaps the `extid` table should be implemented now, starting with these
-identifiers?
-
-### Web Capture CDX `size_bytes`
-
-Pretty straight-forward. 
-
-Considered adding `extra_json` as well, to be consistent with other tables, but
-feels too heavy for the CDX case. Can add later if there is an actual need;
-adding fields easier than removing (for backwards compat).
-
-### Object/Class Name Changes
-
-TODO
-
-### Rust/Python Library Name Changes
-
-Do these as separate commits, after merging back in to master, for v0.3:
-
-- rust `fatcat-api-spec` => `fatcat-openapi`
-- python `fatcat_client` => `fatcat_openapi_client`
-
-### More?
-
-`release_month`: apprently pretty common to know the year and month but not
-date. I have avoided so far, seems like unnecessary complexity. Could start
-as an `extra_json` field?
diff --git a/proposals/20190509_v03_schema_tweaks.md b/proposals/20190509_v03_schema_tweaks.md
new file mode 100644
index 00000000..150ce525
--- /dev/null
+++ b/proposals/20190509_v03_schema_tweaks.md
@@ -0,0 +1,144 @@
+
+Status: implemented
+
+# SQL (and API) schema changes
+
+Intend to make these changes at the same time as bumping OpenAPI schema from
+0.2 to 0.3, along with `20190510_editgroup_endpoint_prefix` and
+`20190510_release_ext_ids`.
+
+Also adding some indices to speed up entity edit history views, but those are
+just a performance change, not visible in API schema.
+
+### Structured Contrib Names
+
+`creator` entities already have "structured" names: in addition to
+`display_name`, there are `given_name` and `surname` fields. This change is to
+add these two fields to release contribs as well (to join `raw_name`).
+
+The two main motivations are:
+
+1. make various representations (eg, citation formats) of release entities
+   easier. CSL and many display formats require given/surname distinctions
+2. improve algorithmic matching between release entities, raw metadata (eg,
+   from GROBID), and citation strings. Eg, biblio-glutton wants "first author
+   surname"; we can't provide this from existing `raw_name` field
+
+The status quo is that many large metadata sources often include structured
+names, and we munge them into a single name.
+
+Some arguments against this change are:
+
+1. should be "normalizing" this structure into creator entities. However,
+   display/representation of a contributor might change between publications
+2. structure isn't always deterministic from what is visible in published
+   documents. AKA, raw name is unambiguous (it's what is "printed" on the
+   document), but given/sur decomposition can be ambiguous (for individauls, or
+   entire locales/cultures)
+3. could just stash in contrib `extra_json`. However, seems common enough to
+   include as full fields
+
+Questions/Decisions:
+
+- should contrib `raw_name` be changed to `display_name` for consistency with
+  `creator`? `raw_name` should probably always be what is in/on the document
+  itself, thus no.
+- should we still munge a `raw_name` at insert time (we we only have structured
+  names), or push this on to client code to always create something for
+  display?
+
+### Rename `release_status` to `release_stage`
+
+Describes the field better. I think this is uncontroversial and not too
+disruptive at this point.
+
+### New release fields: subtitle, number, version
+
+`subtitle`: mostly for books. could have a flat-out style guide policy against
+use for articles? Already frequently add subtitle metadata as an `extra_json`
+field.
+
+`number`: intended to represent, eg, a report number ("RFC ..."). Not to be
+confused with `container-number`, `chapter`, `edition`
+
+`version`: intended to be a short string ("v3", "2", "third", "3.9") to
+disambiguate which among multiple versions. CSL has a separate `edition` field.
+
+These are somewhat hard to justify as dedicated fields vs. `extra_json`.
+
+`subtitle` is a pretty core field for book metadata, but raises ambiguity for
+other release types.
+
+Excited to include many reports and memos (as grey lit), for which the number
+is a pretty major field, and we probably want to include in elasticsearch but
+not as part of the title field, and someday perhaps an index on `number`, so
+that's easier to justify.
+
+TODO:
+
+- `version` maybe should be dropped. arXiv is one possible justification, as is
+  sorting by this field in display.
+
+### Withdrawn fields
+
+As part of a plan to represent retractions and other "unpublishing", decided to
+track when and whether a release has been "withdrawn", distinct from the
+`release_stage`.
+
+To motivate this, consider a work that has been retracted. There are multiple
+releases of different stages; should not set the `release_stage` for all to
+`withdrawn` or `retracted`, because then hard to disambiguate between the
+release entities. Also maybe the pre-print hasn't been formally withdrawn and
+is still in the pre-print server, or maybe only the pre-print was withdrawn
+(for being partial/incorrect?) while the final version is still "active".
+
+As with `release_date`, just `withdrawn_date` is insufficient, so we get
+`withdrawn_year` also...  and `withdrawn_month` in the future? Also
+`withdrawn_state` for cases where we don't know even the year. This could
+probably be a bool (`is_withdrawn` or `withdrawn`), but the flexibility of a
+TEXT/ENUM has been nice.
+
+TODO:
+
+- boolean (`is_withdrawn`, default False) or text (`withdrawn_status`). Let's
+  keep text to allow evolution in the future; if the field is defined at all
+  it's "withdrawn" (true), if not it isn't
+
+### New release extids: `mag_id`, `ark_id`
+
+See also: `20190510_release_ext_ids`.
+
+- `mag_id`: Microsoft Academic Graph identifier.
+- `ark_id`: ARK identifier.
+
+These will likely be the last identifiers added as fields on `release`; a
+future two-stage refactor will be to move these out to a child table (something
+like `extid_type`, `extid_value`, with a UNIQ index for lookups).
+
+Perhaps the `extid` table should be implemented now, starting with these
+identifiers?
+
+### Web Capture CDX `size_bytes`
+
+Pretty straight-forward. 
+
+Considered adding `extra_json` as well, to be consistent with other tables, but
+feels too heavy for the CDX case. Can add later if there is an actual need;
+adding fields easier than removing (for backwards compat).
+
+### Object/Class Name Changes
+
+TODO
+
+### Rust/Python Library Name Changes
+
+Do these as separate commits, after merging back in to master, for v0.3:
+
+- rust `fatcat-api-spec` => `fatcat-openapi`
+- python `fatcat_client` => `fatcat_openapi_client`
+
+### More?
+
+`release_month`: apprently pretty common to know the year and month but not
+date. I have avoided so far, seems like unnecessary complexity. Could start
+as an `extra_json` field? NOT IMPLEMENTED
diff --git a/proposals/20190510_editgroup_endpoint_prefix.md b/proposals/20190510_editgroup_endpoint_prefix.md
index f517383b..6794266e 100644
--- a/proposals/20190510_editgroup_endpoint_prefix.md
+++ b/proposals/20190510_editgroup_endpoint_prefix.md
@@ -1,4 +1,6 @@
 
+Status: implemented
+
 # Editgroup API Endpoint Prefixes
 
 In summary, change the API URL design such that entity mutations (create,
diff --git a/proposals/20190510_release_ext_ids.md b/proposals/20190510_release_ext_ids.md
index 1d2b912a..8953448c 100644
--- a/proposals/20190510_release_ext_ids.md
+++ b/proposals/20190510_release_ext_ids.md
@@ -1,4 +1,6 @@
 
+Status: implemented
+
 # Release External ID Refactor
 
 Goal is to make the external identifier "namespace" (number of external
diff --git a/proposals/20190514_fatcat_identifiers.md b/proposals/20190514_fatcat_identifiers.md
index 941775e3..325e48f5 100644
--- a/proposals/20190514_fatcat_identifiers.md
+++ b/proposals/20190514_fatcat_identifiers.md
@@ -1,4 +1,6 @@
 
+Status: brainstorm
+
 Fatcat Identifiers
 =======================
 
diff --git a/proposals/20190911_search_query_parsing.md b/proposals/20190911_search_query_parsing.md
index 1e656fef..f1fb0128 100644
--- a/proposals/20190911_search_query_parsing.md
+++ b/proposals/20190911_search_query_parsing.md
@@ -1,5 +1,7 @@
 
-status: work-in-progress
+Status: brainstorm
+
+## Search Query Parsing
 
 The default "release" search on fatcat.wiki currently uses the elasticsearch
 built-in `query_string` parser, which is explicitly not recommended for
@@ -20,3 +22,7 @@ A couple search issues this would help with:
 
 In the near future, we may also create a fulltext search index, which will have
 it's own issues.
+
+## Tech Changes
+
+If we haven't already, should also switch to using elasticsearch client library.
diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md
index ce885b95..eaf39474 100644
--- a/proposals/20190911_v04_schema_tweaks.md
+++ b/proposals/20190911_v04_schema_tweaks.md
@@ -1,5 +1,7 @@
 
-status: work-in-progress
+Status: planned
+
+## Schema Changes for v0.4 Release
 
 Proposed schema changes for next fatcat iteration (v0.4? v0.5?).
 
diff --git a/proposals/20191018_bigger_db.md b/proposals/20191018_bigger_db.md
index cd5f6e7b..7a5216d0 100644
--- a/proposals/20191018_bigger_db.md
+++ b/proposals/20191018_bigger_db.md
@@ -1,4 +1,8 @@
 
+Status: brainstorm
+
+## Catalog Database Scaling
+
 How can we scale the fatcat backend to support:
 
 - one billion release entities
diff --git a/proposals/20200103_py37_refactors.md b/proposals/20200103_py37_refactors.md
new file mode 100644
index 00000000..f0321b33
--- /dev/null
+++ b/proposals/20200103_py37_refactors.md
@@ -0,0 +1,101 @@
+
+status: planning
+
+If we update fatcat python code to python3.7, what code refactoring changes can
+we make? We currently use/require python3.5.
+
+Nice features in python3 I know of are:
+
+- dataclasses (python3.7)
+- async/await (mature in python3.7?)
+- type annotations (python3.5)
+- format strings (python3.6)
+- walrus assignment (python3.8)
+
+Not sure if the walrus operator is worth jumping all the way to python3.8.
+
+While we might be at it, what other superficial factorings might we want to do?
+
+- strict lint style (eg, maximum column width) with `black` (python3.6)
+- logging/debugging/verbose
+- type annotations and checking
+- use named dicts or structs in place of dicts
+
+## Linux Distro Support
+
+The default python version shipped by current and planned linux releases are:
+
+- ubuntu xenial 16.04 LTS:  python3.5
+- ubuntu bionic 18.04 LTS:  python3.6
+- ubuntu focal  20.04 LTS:  python3.8 (planned)
+- debian buster 10 2019:    python3.7
+
+Python 3.7 is the default in debian buster (10).
+
+There are apt PPA package repositories that allow backporting newer pythons to
+older releases. As far as I know this is safe and doesn't override any system
+usage if we are careful not to set the defaults (aka, `python3` command should
+be the older version unless inside a virtualenv).
+
+It would also be possible to use `pyenv` to have `virtualenv`s with custom
+python versions. We should probably do that for OS X and/or windows support if
+we wanted those. But having a system package is probably a lot faster to
+install.
+
+## Dataclasses
+
+`dataclasses` are a user-friendly way to create struct-like objects. They are
+pretty similar to the existing `namedtuple`, but can be mutable and have
+methods attached to them (they are just classes), plus several other usability
+improvements.
+
+Most places we are throwing around dicts with structure we could be using
+dataclasses instead. There are some instances of this in fatcat, but many more
+in sandcrawler.
+
+## Async/Await
+
+Where might we actually use async/await? I think more in sandcrawler than in
+the python tools or web apps. The GROBID, ingest, and ML workers in particular
+should be async over batches, as should all fetches from CDX/wayback.
+
+Some of the kafka workers *could* be aync, but i'm not sure how much speedup
+there would actually be. For example, the entity updates worker could fetch
+entities for an editgroup concurrently.
+
+Inserts (importers) should probably mostly happen serially, at least the kafka
+importers, one editgroup at a time, so progress is correctly recorded in kafka.
+Parallelization should probably happen at the partition level; would need to
+think through whether async would actually help with code simplicity vs. thread
+or process parallelization.
+
+## Type Annotations
+
+The meta-goals of (gradual) type annotations would be catching more bugs at
+development time, and having code be more self-documenting and easier to
+understand.
+
+The two big wins I see with type annotation would be having annotations
+auto-generated for the openapi classes and API calls, and to make string
+munging in importer code less buggy.
+
+## Format Strings
+
+Eg, replace code like:
+
+    "There are {} out of {} objects".format(found, total)
+
+With:
+
+    f"There are {found} out of {total} objects"
+
+## Walrus Operator
+
+New operator allows checking and assignment together:
+
+    if (n := len(a)) > 10:
+        print(f"List is too long ({n} elements, expected <= 10)")
+
+I feel like we would actually use this pattern *a ton* in importer code, where
+we do a lot of lookups or cleaning then check if we got a `None`.
+
diff --git a/proposals/2020_py37_refactors.md b/proposals/2020_py37_refactors.md
deleted file mode 100644
index f0321b33..00000000
--- a/proposals/2020_py37_refactors.md
+++ /dev/null
@@ -1,101 +0,0 @@
-
-status: planning
-
-If we update fatcat python code to python3.7, what code refactoring changes can
-we make? We currently use/require python3.5.
-
-Nice features in python3 I know of are:
-
-- dataclasses (python3.7)
-- async/await (mature in python3.7?)
-- type annotations (python3.5)
-- format strings (python3.6)
-- walrus assignment (python3.8)
-
-Not sure if the walrus operator is worth jumping all the way to python3.8.
-
-While we might be at it, what other superficial factorings might we want to do?
-
-- strict lint style (eg, maximum column width) with `black` (python3.6)
-- logging/debugging/verbose
-- type annotations and checking
-- use named dicts or structs in place of dicts
-
-## Linux Distro Support
-
-The default python version shipped by current and planned linux releases are:
-
-- ubuntu xenial 16.04 LTS:  python3.5
-- ubuntu bionic 18.04 LTS:  python3.6
-- ubuntu focal  20.04 LTS:  python3.8 (planned)
-- debian buster 10 2019:    python3.7
-
-Python 3.7 is the default in debian buster (10).
-
-There are apt PPA package repositories that allow backporting newer pythons to
-older releases. As far as I know this is safe and doesn't override any system
-usage if we are careful not to set the defaults (aka, `python3` command should
-be the older version unless inside a virtualenv).
-
-It would also be possible to use `pyenv` to have `virtualenv`s with custom
-python versions. We should probably do that for OS X and/or windows support if
-we wanted those. But having a system package is probably a lot faster to
-install.
-
-## Dataclasses
-
-`dataclasses` are a user-friendly way to create struct-like objects. They are
-pretty similar to the existing `namedtuple`, but can be mutable and have
-methods attached to them (they are just classes), plus several other usability
-improvements.
-
-Most places we are throwing around dicts with structure we could be using
-dataclasses instead. There are some instances of this in fatcat, but many more
-in sandcrawler.
-
-## Async/Await
-
-Where might we actually use async/await? I think more in sandcrawler than in
-the python tools or web apps. The GROBID, ingest, and ML workers in particular
-should be async over batches, as should all fetches from CDX/wayback.
-
-Some of the kafka workers *could* be aync, but i'm not sure how much speedup
-there would actually be. For example, the entity updates worker could fetch
-entities for an editgroup concurrently.
-
-Inserts (importers) should probably mostly happen serially, at least the kafka
-importers, one editgroup at a time, so progress is correctly recorded in kafka.
-Parallelization should probably happen at the partition level; would need to
-think through whether async would actually help with code simplicity vs. thread
-or process parallelization.
-
-## Type Annotations
-
-The meta-goals of (gradual) type annotations would be catching more bugs at
-development time, and having code be more self-documenting and easier to
-understand.
-
-The two big wins I see with type annotation would be having annotations
-auto-generated for the openapi classes and API calls, and to make string
-munging in importer code less buggy.
-
-## Format Strings
-
-Eg, replace code like:
-
-    "There are {} out of {} objects".format(found, total)
-
-With:
-
-    f"There are {found} out of {total} objects"
-
-## Walrus Operator
-
-New operator allows checking and assignment together:
-
-    if (n := len(a)) > 10:
-        print(f"List is too long ({n} elements, expected <= 10)")
-
-I feel like we would actually use this pattern *a ton* in importer code, where
-we do a lot of lookups or cleaning then check if we got a `None`.
-
diff --git a/proposals/README.md b/proposals/README.md
new file mode 100644
index 00000000..5e6747b1
--- /dev/null
+++ b/proposals/README.md
@@ -0,0 +1,11 @@
+
+This folder contains proposals for larger changes to the fatcat system. These
+might be schema changes, new projects, technical details, etc. Any change which
+is large enough to require planning and documentation.
+
+Each should be tagged with a date first drafted, and labeled with a status:
+
+- brainstorm: just putting ideas down; might not even happen
+- planned: commited to happening, but not started yet
+- work-in-progress: currently being worked on
+- implemented: completed, merged to master/production/live
-- 
cgit v1.2.3