diff options
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/20190509_v03_schema_tweaks.md (renamed from proposals/20190509_schema_tweaks.md) | 4 | ||||
-rw-r--r-- | proposals/20190510_editgroup_endpoint_prefix.md | 2 | ||||
-rw-r--r-- | proposals/20190510_release_ext_ids.md | 2 | ||||
-rw-r--r-- | proposals/20190514_fatcat_identifiers.md | 27 | ||||
-rw-r--r-- | proposals/20190911_search_query_parsing.md | 28 | ||||
-rw-r--r-- | proposals/20190911_v04_schema_tweaks.md | 7 | ||||
-rw-r--r-- | proposals/20191018_bigger_db.md | 4 | ||||
-rw-r--r-- | proposals/20200103_py37_refactors.md | 101 | ||||
-rw-r--r-- | proposals/README.md | 11 |
9 files changed, 184 insertions, 2 deletions
diff --git a/proposals/20190509_schema_tweaks.md b/proposals/20190509_v03_schema_tweaks.md index 7e372959..150ce525 100644 --- a/proposals/20190509_schema_tweaks.md +++ b/proposals/20190509_v03_schema_tweaks.md @@ -1,4 +1,6 @@ +Status: implemented + # SQL (and API) schema changes Intend to make these changes at the same time as bumping OpenAPI schema from @@ -139,4 +141,4 @@ Do these as separate commits, after merging back in to master, for v0.3: `release_month`: apprently pretty common to know the year and month but not date. I have avoided so far, seems like unnecessary complexity. Could start -as an `extra_json` field? +as an `extra_json` field? NOT IMPLEMENTED diff --git a/proposals/20190510_editgroup_endpoint_prefix.md b/proposals/20190510_editgroup_endpoint_prefix.md index f517383b..6794266e 100644 --- a/proposals/20190510_editgroup_endpoint_prefix.md +++ b/proposals/20190510_editgroup_endpoint_prefix.md @@ -1,4 +1,6 @@ +Status: implemented + # Editgroup API Endpoint Prefixes In summary, change the API URL design such that entity mutations (create, diff --git a/proposals/20190510_release_ext_ids.md b/proposals/20190510_release_ext_ids.md index 1d2b912a..8953448c 100644 --- a/proposals/20190510_release_ext_ids.md +++ b/proposals/20190510_release_ext_ids.md @@ -1,4 +1,6 @@ +Status: implemented + # Release External ID Refactor Goal is to make the external identifier "namespace" (number of external diff --git a/proposals/20190514_fatcat_identifiers.md b/proposals/20190514_fatcat_identifiers.md new file mode 100644 index 00000000..325e48f5 --- /dev/null +++ b/proposals/20190514_fatcat_identifiers.md @@ -0,0 +1,27 @@ + +Status: brainstorm + +Fatcat Identifiers +======================= + +AKA, `fcid` + +## Public Use / Reference + +When referencing identifiers in external databases, should prefix with the +entity type. Eg: + + release_hsmo6p4smrganpb3fndaj2lon4 + editgroup_qinmjr2lbvgd3mbt7mifir23fy + +Or with a prefix: + + fatcat:release_hsmo6p4smrganpb3fndaj2lon4 + +As a usability affordance, the public web interface (though not API) should do +permanent redirects HTTP (301 or 308) to the canonical page like: + + https://fatcat.wiki/release_hsmo6p4smrganpb3fndaj2lon4 + HTTP 301 => https://fatcat.wiki/release/hsmo6p4smrganpb3fndaj2lon4 + +However, no intention to use identifiers in this schema in the API itself? diff --git a/proposals/20190911_search_query_parsing.md b/proposals/20190911_search_query_parsing.md new file mode 100644 index 00000000..f1fb0128 --- /dev/null +++ b/proposals/20190911_search_query_parsing.md @@ -0,0 +1,28 @@ + +Status: brainstorm + +## Search Query Parsing + +The default "release" search on fatcat.wiki currently uses the elasticsearch +built-in `query_string` parser, which is explicitly not recommended for +public/production use. + +The best way forward is likely a custom query parser (eg, PEG-generated parser) +that generates a complete elasticsearch query JSON structure. + +A couple search issues this would help with: + +- better parsing of keywords (year, year-range, DOI, ISSN, etc) in complex + queries and turning these in to keyword term sub-queries +- queries including terms from multiple fields which aren't explicitly tagged + (eg, "lovelace computer" vs. "author:lovelace title:computer") +- avoiding unsustainably expensive queries (eg, prefix wildcard, regex) +- handling single-character mispellings and synonyms +- collapsing multiple releases under the same work in search results + +In the near future, we may also create a fulltext search index, which will have +it's own issues. + +## Tech Changes + +If we haven't already, should also switch to using elasticsearch client library. diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md index 8ccbac79..eaf39474 100644 --- a/proposals/20190911_v04_schema_tweaks.md +++ b/proposals/20190911_v04_schema_tweaks.md @@ -1,5 +1,7 @@ -status: work-in-progress +Status: planned + +## Schema Changes for v0.4 Release Proposed schema changes for next fatcat iteration (v0.4? v0.5?). @@ -17,6 +19,9 @@ SQL (and API, and elasticsearch): - TODO: release: switch how pages work? first/last? - TODO: indication of peer-review process? at release or container level? - TODO: container: separate canonical and disambiguating titles (?) +- TODO: release inter-references using SCHOLIX/Datacite schema + https://zenodo.org/record/1120265 + https://support.datacite.org/docs/connecting-research-outputs#section-related-identifiers API tweaks: diff --git a/proposals/20191018_bigger_db.md b/proposals/20191018_bigger_db.md index cd5f6e7b..7a5216d0 100644 --- a/proposals/20191018_bigger_db.md +++ b/proposals/20191018_bigger_db.md @@ -1,4 +1,8 @@ +Status: brainstorm + +## Catalog Database Scaling + How can we scale the fatcat backend to support: - one billion release entities diff --git a/proposals/20200103_py37_refactors.md b/proposals/20200103_py37_refactors.md new file mode 100644 index 00000000..f0321b33 --- /dev/null +++ b/proposals/20200103_py37_refactors.md @@ -0,0 +1,101 @@ + +status: planning + +If we update fatcat python code to python3.7, what code refactoring changes can +we make? We currently use/require python3.5. + +Nice features in python3 I know of are: + +- dataclasses (python3.7) +- async/await (mature in python3.7?) +- type annotations (python3.5) +- format strings (python3.6) +- walrus assignment (python3.8) + +Not sure if the walrus operator is worth jumping all the way to python3.8. + +While we might be at it, what other superficial factorings might we want to do? + +- strict lint style (eg, maximum column width) with `black` (python3.6) +- logging/debugging/verbose +- type annotations and checking +- use named dicts or structs in place of dicts + +## Linux Distro Support + +The default python version shipped by current and planned linux releases are: + +- ubuntu xenial 16.04 LTS: python3.5 +- ubuntu bionic 18.04 LTS: python3.6 +- ubuntu focal 20.04 LTS: python3.8 (planned) +- debian buster 10 2019: python3.7 + +Python 3.7 is the default in debian buster (10). + +There are apt PPA package repositories that allow backporting newer pythons to +older releases. As far as I know this is safe and doesn't override any system +usage if we are careful not to set the defaults (aka, `python3` command should +be the older version unless inside a virtualenv). + +It would also be possible to use `pyenv` to have `virtualenv`s with custom +python versions. We should probably do that for OS X and/or windows support if +we wanted those. But having a system package is probably a lot faster to +install. + +## Dataclasses + +`dataclasses` are a user-friendly way to create struct-like objects. They are +pretty similar to the existing `namedtuple`, but can be mutable and have +methods attached to them (they are just classes), plus several other usability +improvements. + +Most places we are throwing around dicts with structure we could be using +dataclasses instead. There are some instances of this in fatcat, but many more +in sandcrawler. + +## Async/Await + +Where might we actually use async/await? I think more in sandcrawler than in +the python tools or web apps. The GROBID, ingest, and ML workers in particular +should be async over batches, as should all fetches from CDX/wayback. + +Some of the kafka workers *could* be aync, but i'm not sure how much speedup +there would actually be. For example, the entity updates worker could fetch +entities for an editgroup concurrently. + +Inserts (importers) should probably mostly happen serially, at least the kafka +importers, one editgroup at a time, so progress is correctly recorded in kafka. +Parallelization should probably happen at the partition level; would need to +think through whether async would actually help with code simplicity vs. thread +or process parallelization. + +## Type Annotations + +The meta-goals of (gradual) type annotations would be catching more bugs at +development time, and having code be more self-documenting and easier to +understand. + +The two big wins I see with type annotation would be having annotations +auto-generated for the openapi classes and API calls, and to make string +munging in importer code less buggy. + +## Format Strings + +Eg, replace code like: + + "There are {} out of {} objects".format(found, total) + +With: + + f"There are {found} out of {total} objects" + +## Walrus Operator + +New operator allows checking and assignment together: + + if (n := len(a)) > 10: + print(f"List is too long ({n} elements, expected <= 10)") + +I feel like we would actually use this pattern *a ton* in importer code, where +we do a lot of lookups or cleaning then check if we got a `None`. + diff --git a/proposals/README.md b/proposals/README.md new file mode 100644 index 00000000..5e6747b1 --- /dev/null +++ b/proposals/README.md @@ -0,0 +1,11 @@ + +This folder contains proposals for larger changes to the fatcat system. These +might be schema changes, new projects, technical details, etc. Any change which +is large enough to require planning and documentation. + +Each should be tagged with a date first drafted, and labeled with a status: + +- brainstorm: just putting ideas down; might not even happen +- planned: commited to happening, but not started yet +- work-in-progress: currently being worked on +- implemented: completed, merged to master/production/live |