From 3a57c35ddcf794d7211d1649e74a9917bd1c9495 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 3 Jan 2020 16:05:07 -0800 Subject: proposals: standardize a bit --- proposals/20190509_schema_tweaks.md | 142 ----------------------- proposals/20190509_v03_schema_tweaks.md | 144 ++++++++++++++++++++++++ proposals/20190510_editgroup_endpoint_prefix.md | 2 + proposals/20190510_release_ext_ids.md | 2 + proposals/20190514_fatcat_identifiers.md | 2 + proposals/20190911_search_query_parsing.md | 8 +- proposals/20190911_v04_schema_tweaks.md | 4 +- proposals/20191018_bigger_db.md | 4 + proposals/20200103_py37_refactors.md | 101 +++++++++++++++++ proposals/2020_py37_refactors.md | 101 ----------------- proposals/README.md | 11 ++ 11 files changed, 276 insertions(+), 245 deletions(-) delete mode 100644 proposals/20190509_schema_tweaks.md create mode 100644 proposals/20190509_v03_schema_tweaks.md create mode 100644 proposals/20200103_py37_refactors.md delete mode 100644 proposals/2020_py37_refactors.md create mode 100644 proposals/README.md diff --git a/proposals/20190509_schema_tweaks.md b/proposals/20190509_schema_tweaks.md deleted file mode 100644 index 7e372959..00000000 --- a/proposals/20190509_schema_tweaks.md +++ /dev/null @@ -1,142 +0,0 @@ - -# SQL (and API) schema changes - -Intend to make these changes at the same time as bumping OpenAPI schema from -0.2 to 0.3, along with `20190510_editgroup_endpoint_prefix` and -`20190510_release_ext_ids`. - -Also adding some indices to speed up entity edit history views, but those are -just a performance change, not visible in API schema. - -### Structured Contrib Names - -`creator` entities already have "structured" names: in addition to -`display_name`, there are `given_name` and `surname` fields. This change is to -add these two fields to release contribs as well (to join `raw_name`). - -The two main motivations are: - -1. make various representations (eg, citation formats) of release entities - easier. CSL and many display formats require given/surname distinctions -2. improve algorithmic matching between release entities, raw metadata (eg, - from GROBID), and citation strings. Eg, biblio-glutton wants "first author - surname"; we can't provide this from existing `raw_name` field - -The status quo is that many large metadata sources often include structured -names, and we munge them into a single name. - -Some arguments against this change are: - -1. should be "normalizing" this structure into creator entities. However, - display/representation of a contributor might change between publications -2. structure isn't always deterministic from what is visible in published - documents. AKA, raw name is unambiguous (it's what is "printed" on the - document), but given/sur decomposition can be ambiguous (for individauls, or - entire locales/cultures) -3. could just stash in contrib `extra_json`. However, seems common enough to - include as full fields - -Questions/Decisions: - -- should contrib `raw_name` be changed to `display_name` for consistency with - `creator`? `raw_name` should probably always be what is in/on the document - itself, thus no. -- should we still munge a `raw_name` at insert time (we we only have structured - names), or push this on to client code to always create something for - display? - -### Rename `release_status` to `release_stage` - -Describes the field better. I think this is uncontroversial and not too -disruptive at this point. - -### New release fields: subtitle, number, version - -`subtitle`: mostly for books. could have a flat-out style guide policy against -use for articles? Already frequently add subtitle metadata as an `extra_json` -field. - -`number`: intended to represent, eg, a report number ("RFC ..."). Not to be -confused with `container-number`, `chapter`, `edition` - -`version`: intended to be a short string ("v3", "2", "third", "3.9") to -disambiguate which among multiple versions. CSL has a separate `edition` field. - -These are somewhat hard to justify as dedicated fields vs. `extra_json`. - -`subtitle` is a pretty core field for book metadata, but raises ambiguity for -other release types. - -Excited to include many reports and memos (as grey lit), for which the number -is a pretty major field, and we probably want to include in elasticsearch but -not as part of the title field, and someday perhaps an index on `number`, so -that's easier to justify. - -TODO: - -- `version` maybe should be dropped. arXiv is one possible justification, as is - sorting by this field in display. - -### Withdrawn fields - -As part of a plan to represent retractions and other "unpublishing", decided to -track when and whether a release has been "withdrawn", distinct from the -`release_stage`. - -To motivate this, consider a work that has been retracted. There are multiple -releases of different stages; should not set the `release_stage` for all to -`withdrawn` or `retracted`, because then hard to disambiguate between the -release entities. Also maybe the pre-print hasn't been formally withdrawn and -is still in the pre-print server, or maybe only the pre-print was withdrawn -(for being partial/incorrect?) while the final version is still "active". - -As with `release_date`, just `withdrawn_date` is insufficient, so we get -`withdrawn_year` also... and `withdrawn_month` in the future? Also -`withdrawn_state` for cases where we don't know even the year. This could -probably be a bool (`is_withdrawn` or `withdrawn`), but the flexibility of a -TEXT/ENUM has been nice. - -TODO: - -- boolean (`is_withdrawn`, default False) or text (`withdrawn_status`). Let's - keep text to allow evolution in the future; if the field is defined at all - it's "withdrawn" (true), if not it isn't - -### New release extids: `mag_id`, `ark_id` - -See also: `20190510_release_ext_ids`. - -- `mag_id`: Microsoft Academic Graph identifier. -- `ark_id`: ARK identifier. - -These will likely be the last identifiers added as fields on `release`; a -future two-stage refactor will be to move these out to a child table (something -like `extid_type`, `extid_value`, with a UNIQ index for lookups). - -Perhaps the `extid` table should be implemented now, starting with these -identifiers? - -### Web Capture CDX `size_bytes` - -Pretty straight-forward. - -Considered adding `extra_json` as well, to be consistent with other tables, but -feels too heavy for the CDX case. Can add later if there is an actual need; -adding fields easier than removing (for backwards compat). - -### Object/Class Name Changes - -TODO - -### Rust/Python Library Name Changes - -Do these as separate commits, after merging back in to master, for v0.3: - -- rust `fatcat-api-spec` => `fatcat-openapi` -- python `fatcat_client` => `fatcat_openapi_client` - -### More? - -`release_month`: apprently pretty common to know the year and month but not -date. I have avoided so far, seems like unnecessary complexity. Could start -as an `extra_json` field? diff --git a/proposals/20190509_v03_schema_tweaks.md b/proposals/20190509_v03_schema_tweaks.md new file mode 100644 index 00000000..150ce525 --- /dev/null +++ b/proposals/20190509_v03_schema_tweaks.md @@ -0,0 +1,144 @@ + +Status: implemented + +# SQL (and API) schema changes + +Intend to make these changes at the same time as bumping OpenAPI schema from +0.2 to 0.3, along with `20190510_editgroup_endpoint_prefix` and +`20190510_release_ext_ids`. + +Also adding some indices to speed up entity edit history views, but those are +just a performance change, not visible in API schema. + +### Structured Contrib Names + +`creator` entities already have "structured" names: in addition to +`display_name`, there are `given_name` and `surname` fields. This change is to +add these two fields to release contribs as well (to join `raw_name`). + +The two main motivations are: + +1. make various representations (eg, citation formats) of release entities + easier. CSL and many display formats require given/surname distinctions +2. improve algorithmic matching between release entities, raw metadata (eg, + from GROBID), and citation strings. Eg, biblio-glutton wants "first author + surname"; we can't provide this from existing `raw_name` field + +The status quo is that many large metadata sources often include structured +names, and we munge them into a single name. + +Some arguments against this change are: + +1. should be "normalizing" this structure into creator entities. However, + display/representation of a contributor might change between publications +2. structure isn't always deterministic from what is visible in published + documents. AKA, raw name is unambiguous (it's what is "printed" on the + document), but given/sur decomposition can be ambiguous (for individauls, or + entire locales/cultures) +3. could just stash in contrib `extra_json`. However, seems common enough to + include as full fields + +Questions/Decisions: + +- should contrib `raw_name` be changed to `display_name` for consistency with + `creator`? `raw_name` should probably always be what is in/on the document + itself, thus no. +- should we still munge a `raw_name` at insert time (we we only have structured + names), or push this on to client code to always create something for + display? + +### Rename `release_status` to `release_stage` + +Describes the field better. I think this is uncontroversial and not too +disruptive at this point. + +### New release fields: subtitle, number, version + +`subtitle`: mostly for books. could have a flat-out style guide policy against +use for articles? Already frequently add subtitle metadata as an `extra_json` +field. + +`number`: intended to represent, eg, a report number ("RFC ..."). Not to be +confused with `container-number`, `chapter`, `edition` + +`version`: intended to be a short string ("v3", "2", "third", "3.9") to +disambiguate which among multiple versions. CSL has a separate `edition` field. + +These are somewhat hard to justify as dedicated fields vs. `extra_json`. + +`subtitle` is a pretty core field for book metadata, but raises ambiguity for +other release types. + +Excited to include many reports and memos (as grey lit), for which the number +is a pretty major field, and we probably want to include in elasticsearch but +not as part of the title field, and someday perhaps an index on `number`, so +that's easier to justify. + +TODO: + +- `version` maybe should be dropped. arXiv is one possible justification, as is + sorting by this field in display. + +### Withdrawn fields + +As part of a plan to represent retractions and other "unpublishing", decided to +track when and whether a release has been "withdrawn", distinct from the +`release_stage`. + +To motivate this, consider a work that has been retracted. There are multiple +releases of different stages; should not set the `release_stage` for all to +`withdrawn` or `retracted`, because then hard to disambiguate between the +release entities. Also maybe the pre-print hasn't been formally withdrawn and +is still in the pre-print server, or maybe only the pre-print was withdrawn +(for being partial/incorrect?) while the final version is still "active". + +As with `release_date`, just `withdrawn_date` is insufficient, so we get +`withdrawn_year` also... and `withdrawn_month` in the future? Also +`withdrawn_state` for cases where we don't know even the year. This could +probably be a bool (`is_withdrawn` or `withdrawn`), but the flexibility of a +TEXT/ENUM has been nice. + +TODO: + +- boolean (`is_withdrawn`, default False) or text (`withdrawn_status`). Let's + keep text to allow evolution in the future; if the field is defined at all + it's "withdrawn" (true), if not it isn't + +### New release extids: `mag_id`, `ark_id` + +See also: `20190510_release_ext_ids`. + +- `mag_id`: Microsoft Academic Graph identifier. +- `ark_id`: ARK identifier. + +These will likely be the last identifiers added as fields on `release`; a +future two-stage refactor will be to move these out to a child table (something +like `extid_type`, `extid_value`, with a UNIQ index for lookups). + +Perhaps the `extid` table should be implemented now, starting with these +identifiers? + +### Web Capture CDX `size_bytes` + +Pretty straight-forward. + +Considered adding `extra_json` as well, to be consistent with other tables, but +feels too heavy for the CDX case. Can add later if there is an actual need; +adding fields easier than removing (for backwards compat). + +### Object/Class Name Changes + +TODO + +### Rust/Python Library Name Changes + +Do these as separate commits, after merging back in to master, for v0.3: + +- rust `fatcat-api-spec` => `fatcat-openapi` +- python `fatcat_client` => `fatcat_openapi_client` + +### More? + +`release_month`: apprently pretty common to know the year and month but not +date. I have avoided so far, seems like unnecessary complexity. Could start +as an `extra_json` field? NOT IMPLEMENTED diff --git a/proposals/20190510_editgroup_endpoint_prefix.md b/proposals/20190510_editgroup_endpoint_prefix.md index f517383b..6794266e 100644 --- a/proposals/20190510_editgroup_endpoint_prefix.md +++ b/proposals/20190510_editgroup_endpoint_prefix.md @@ -1,4 +1,6 @@ +Status: implemented + # Editgroup API Endpoint Prefixes In summary, change the API URL design such that entity mutations (create, diff --git a/proposals/20190510_release_ext_ids.md b/proposals/20190510_release_ext_ids.md index 1d2b912a..8953448c 100644 --- a/proposals/20190510_release_ext_ids.md +++ b/proposals/20190510_release_ext_ids.md @@ -1,4 +1,6 @@ +Status: implemented + # Release External ID Refactor Goal is to make the external identifier "namespace" (number of external diff --git a/proposals/20190514_fatcat_identifiers.md b/proposals/20190514_fatcat_identifiers.md index 941775e3..325e48f5 100644 --- a/proposals/20190514_fatcat_identifiers.md +++ b/proposals/20190514_fatcat_identifiers.md @@ -1,4 +1,6 @@ +Status: brainstorm + Fatcat Identifiers ======================= diff --git a/proposals/20190911_search_query_parsing.md b/proposals/20190911_search_query_parsing.md index 1e656fef..f1fb0128 100644 --- a/proposals/20190911_search_query_parsing.md +++ b/proposals/20190911_search_query_parsing.md @@ -1,5 +1,7 @@ -status: work-in-progress +Status: brainstorm + +## Search Query Parsing The default "release" search on fatcat.wiki currently uses the elasticsearch built-in `query_string` parser, which is explicitly not recommended for @@ -20,3 +22,7 @@ A couple search issues this would help with: In the near future, we may also create a fulltext search index, which will have it's own issues. + +## Tech Changes + +If we haven't already, should also switch to using elasticsearch client library. diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md index ce885b95..eaf39474 100644 --- a/proposals/20190911_v04_schema_tweaks.md +++ b/proposals/20190911_v04_schema_tweaks.md @@ -1,5 +1,7 @@ -status: work-in-progress +Status: planned + +## Schema Changes for v0.4 Release Proposed schema changes for next fatcat iteration (v0.4? v0.5?). diff --git a/proposals/20191018_bigger_db.md b/proposals/20191018_bigger_db.md index cd5f6e7b..7a5216d0 100644 --- a/proposals/20191018_bigger_db.md +++ b/proposals/20191018_bigger_db.md @@ -1,4 +1,8 @@ +Status: brainstorm + +## Catalog Database Scaling + How can we scale the fatcat backend to support: - one billion release entities diff --git a/proposals/20200103_py37_refactors.md b/proposals/20200103_py37_refactors.md new file mode 100644 index 00000000..f0321b33 --- /dev/null +++ b/proposals/20200103_py37_refactors.md @@ -0,0 +1,101 @@ + +status: planning + +If we update fatcat python code to python3.7, what code refactoring changes can +we make? We currently use/require python3.5. + +Nice features in python3 I know of are: + +- dataclasses (python3.7) +- async/await (mature in python3.7?) +- type annotations (python3.5) +- format strings (python3.6) +- walrus assignment (python3.8) + +Not sure if the walrus operator is worth jumping all the way to python3.8. + +While we might be at it, what other superficial factorings might we want to do? + +- strict lint style (eg, maximum column width) with `black` (python3.6) +- logging/debugging/verbose +- type annotations and checking +- use named dicts or structs in place of dicts + +## Linux Distro Support + +The default python version shipped by current and planned linux releases are: + +- ubuntu xenial 16.04 LTS: python3.5 +- ubuntu bionic 18.04 LTS: python3.6 +- ubuntu focal 20.04 LTS: python3.8 (planned) +- debian buster 10 2019: python3.7 + +Python 3.7 is the default in debian buster (10). + +There are apt PPA package repositories that allow backporting newer pythons to +older releases. As far as I know this is safe and doesn't override any system +usage if we are careful not to set the defaults (aka, `python3` command should +be the older version unless inside a virtualenv). + +It would also be possible to use `pyenv` to have `virtualenv`s with custom +python versions. We should probably do that for OS X and/or windows support if +we wanted those. But having a system package is probably a lot faster to +install. + +## Dataclasses + +`dataclasses` are a user-friendly way to create struct-like objects. They are +pretty similar to the existing `namedtuple`, but can be mutable and have +methods attached to them (they are just classes), plus several other usability +improvements. + +Most places we are throwing around dicts with structure we could be using +dataclasses instead. There are some instances of this in fatcat, but many more +in sandcrawler. + +## Async/Await + +Where might we actually use async/await? I think more in sandcrawler than in +the python tools or web apps. The GROBID, ingest, and ML workers in particular +should be async over batches, as should all fetches from CDX/wayback. + +Some of the kafka workers *could* be aync, but i'm not sure how much speedup +there would actually be. For example, the entity updates worker could fetch +entities for an editgroup concurrently. + +Inserts (importers) should probably mostly happen serially, at least the kafka +importers, one editgroup at a time, so progress is correctly recorded in kafka. +Parallelization should probably happen at the partition level; would need to +think through whether async would actually help with code simplicity vs. thread +or process parallelization. + +## Type Annotations + +The meta-goals of (gradual) type annotations would be catching more bugs at +development time, and having code be more self-documenting and easier to +understand. + +The two big wins I see with type annotation would be having annotations +auto-generated for the openapi classes and API calls, and to make string +munging in importer code less buggy. + +## Format Strings + +Eg, replace code like: + + "There are {} out of {} objects".format(found, total) + +With: + + f"There are {found} out of {total} objects" + +## Walrus Operator + +New operator allows checking and assignment together: + + if (n := len(a)) > 10: + print(f"List is too long ({n} elements, expected <= 10)") + +I feel like we would actually use this pattern *a ton* in importer code, where +we do a lot of lookups or cleaning then check if we got a `None`. + diff --git a/proposals/2020_py37_refactors.md b/proposals/2020_py37_refactors.md deleted file mode 100644 index f0321b33..00000000 --- a/proposals/2020_py37_refactors.md +++ /dev/null @@ -1,101 +0,0 @@ - -status: planning - -If we update fatcat python code to python3.7, what code refactoring changes can -we make? We currently use/require python3.5. - -Nice features in python3 I know of are: - -- dataclasses (python3.7) -- async/await (mature in python3.7?) -- type annotations (python3.5) -- format strings (python3.6) -- walrus assignment (python3.8) - -Not sure if the walrus operator is worth jumping all the way to python3.8. - -While we might be at it, what other superficial factorings might we want to do? - -- strict lint style (eg, maximum column width) with `black` (python3.6) -- logging/debugging/verbose -- type annotations and checking -- use named dicts or structs in place of dicts - -## Linux Distro Support - -The default python version shipped by current and planned linux releases are: - -- ubuntu xenial 16.04 LTS: python3.5 -- ubuntu bionic 18.04 LTS: python3.6 -- ubuntu focal 20.04 LTS: python3.8 (planned) -- debian buster 10 2019: python3.7 - -Python 3.7 is the default in debian buster (10). - -There are apt PPA package repositories that allow backporting newer pythons to -older releases. As far as I know this is safe and doesn't override any system -usage if we are careful not to set the defaults (aka, `python3` command should -be the older version unless inside a virtualenv). - -It would also be possible to use `pyenv` to have `virtualenv`s with custom -python versions. We should probably do that for OS X and/or windows support if -we wanted those. But having a system package is probably a lot faster to -install. - -## Dataclasses - -`dataclasses` are a user-friendly way to create struct-like objects. They are -pretty similar to the existing `namedtuple`, but can be mutable and have -methods attached to them (they are just classes), plus several other usability -improvements. - -Most places we are throwing around dicts with structure we could be using -dataclasses instead. There are some instances of this in fatcat, but many more -in sandcrawler. - -## Async/Await - -Where might we actually use async/await? I think more in sandcrawler than in -the python tools or web apps. The GROBID, ingest, and ML workers in particular -should be async over batches, as should all fetches from CDX/wayback. - -Some of the kafka workers *could* be aync, but i'm not sure how much speedup -there would actually be. For example, the entity updates worker could fetch -entities for an editgroup concurrently. - -Inserts (importers) should probably mostly happen serially, at least the kafka -importers, one editgroup at a time, so progress is correctly recorded in kafka. -Parallelization should probably happen at the partition level; would need to -think through whether async would actually help with code simplicity vs. thread -or process parallelization. - -## Type Annotations - -The meta-goals of (gradual) type annotations would be catching more bugs at -development time, and having code be more self-documenting and easier to -understand. - -The two big wins I see with type annotation would be having annotations -auto-generated for the openapi classes and API calls, and to make string -munging in importer code less buggy. - -## Format Strings - -Eg, replace code like: - - "There are {} out of {} objects".format(found, total) - -With: - - f"There are {found} out of {total} objects" - -## Walrus Operator - -New operator allows checking and assignment together: - - if (n := len(a)) > 10: - print(f"List is too long ({n} elements, expected <= 10)") - -I feel like we would actually use this pattern *a ton* in importer code, where -we do a lot of lookups or cleaning then check if we got a `None`. - diff --git a/proposals/README.md b/proposals/README.md new file mode 100644 index 00000000..5e6747b1 --- /dev/null +++ b/proposals/README.md @@ -0,0 +1,11 @@ + +This folder contains proposals for larger changes to the fatcat system. These +might be schema changes, new projects, technical details, etc. Any change which +is large enough to require planning and documentation. + +Each should be tagged with a date first drafted, and labeled with a status: + +- brainstorm: just putting ideas down; might not even happen +- planned: commited to happening, but not started yet +- work-in-progress: currently being worked on +- implemented: completed, merged to master/production/live -- cgit v1.2.3