summaryrefslogtreecommitdiffstats
path: root/proposals
diff options
context:
space:
mode:
Diffstat (limited to 'proposals')
-rw-r--r--proposals/20190509_v03_schema_tweaks.md (renamed from proposals/20190509_schema_tweaks.md)4
-rw-r--r--proposals/20190510_editgroup_endpoint_prefix.md2
-rw-r--r--proposals/20190510_release_ext_ids.md2
-rw-r--r--proposals/20190514_fatcat_identifiers.md27
-rw-r--r--proposals/20190911_search_query_parsing.md28
-rw-r--r--proposals/20190911_v04_schema_tweaks.md7
-rw-r--r--proposals/20191018_bigger_db.md4
-rw-r--r--proposals/20200103_py37_refactors.md101
-rw-r--r--proposals/README.md11
9 files changed, 184 insertions, 2 deletions
diff --git a/proposals/20190509_schema_tweaks.md b/proposals/20190509_v03_schema_tweaks.md
index 7e372959..150ce525 100644
--- a/proposals/20190509_schema_tweaks.md
+++ b/proposals/20190509_v03_schema_tweaks.md
@@ -1,4 +1,6 @@
+Status: implemented
+
# SQL (and API) schema changes
Intend to make these changes at the same time as bumping OpenAPI schema from
@@ -139,4 +141,4 @@ Do these as separate commits, after merging back in to master, for v0.3:
`release_month`: apprently pretty common to know the year and month but not
date. I have avoided so far, seems like unnecessary complexity. Could start
-as an `extra_json` field?
+as an `extra_json` field? NOT IMPLEMENTED
diff --git a/proposals/20190510_editgroup_endpoint_prefix.md b/proposals/20190510_editgroup_endpoint_prefix.md
index f517383b..6794266e 100644
--- a/proposals/20190510_editgroup_endpoint_prefix.md
+++ b/proposals/20190510_editgroup_endpoint_prefix.md
@@ -1,4 +1,6 @@
+Status: implemented
+
# Editgroup API Endpoint Prefixes
In summary, change the API URL design such that entity mutations (create,
diff --git a/proposals/20190510_release_ext_ids.md b/proposals/20190510_release_ext_ids.md
index 1d2b912a..8953448c 100644
--- a/proposals/20190510_release_ext_ids.md
+++ b/proposals/20190510_release_ext_ids.md
@@ -1,4 +1,6 @@
+Status: implemented
+
# Release External ID Refactor
Goal is to make the external identifier "namespace" (number of external
diff --git a/proposals/20190514_fatcat_identifiers.md b/proposals/20190514_fatcat_identifiers.md
new file mode 100644
index 00000000..325e48f5
--- /dev/null
+++ b/proposals/20190514_fatcat_identifiers.md
@@ -0,0 +1,27 @@
+
+Status: brainstorm
+
+Fatcat Identifiers
+=======================
+
+AKA, `fcid`
+
+## Public Use / Reference
+
+When referencing identifiers in external databases, should prefix with the
+entity type. Eg:
+
+ release_hsmo6p4smrganpb3fndaj2lon4
+ editgroup_qinmjr2lbvgd3mbt7mifir23fy
+
+Or with a prefix:
+
+ fatcat:release_hsmo6p4smrganpb3fndaj2lon4
+
+As a usability affordance, the public web interface (though not API) should do
+permanent redirects HTTP (301 or 308) to the canonical page like:
+
+ https://fatcat.wiki/release_hsmo6p4smrganpb3fndaj2lon4
+ HTTP 301 => https://fatcat.wiki/release/hsmo6p4smrganpb3fndaj2lon4
+
+However, no intention to use identifiers in this schema in the API itself?
diff --git a/proposals/20190911_search_query_parsing.md b/proposals/20190911_search_query_parsing.md
new file mode 100644
index 00000000..f1fb0128
--- /dev/null
+++ b/proposals/20190911_search_query_parsing.md
@@ -0,0 +1,28 @@
+
+Status: brainstorm
+
+## Search Query Parsing
+
+The default "release" search on fatcat.wiki currently uses the elasticsearch
+built-in `query_string` parser, which is explicitly not recommended for
+public/production use.
+
+The best way forward is likely a custom query parser (eg, PEG-generated parser)
+that generates a complete elasticsearch query JSON structure.
+
+A couple search issues this would help with:
+
+- better parsing of keywords (year, year-range, DOI, ISSN, etc) in complex
+ queries and turning these in to keyword term sub-queries
+- queries including terms from multiple fields which aren't explicitly tagged
+ (eg, "lovelace computer" vs. "author:lovelace title:computer")
+- avoiding unsustainably expensive queries (eg, prefix wildcard, regex)
+- handling single-character mispellings and synonyms
+- collapsing multiple releases under the same work in search results
+
+In the near future, we may also create a fulltext search index, which will have
+it's own issues.
+
+## Tech Changes
+
+If we haven't already, should also switch to using elasticsearch client library.
diff --git a/proposals/20190911_v04_schema_tweaks.md b/proposals/20190911_v04_schema_tweaks.md
index 8ccbac79..eaf39474 100644
--- a/proposals/20190911_v04_schema_tweaks.md
+++ b/proposals/20190911_v04_schema_tweaks.md
@@ -1,5 +1,7 @@
-status: work-in-progress
+Status: planned
+
+## Schema Changes for v0.4 Release
Proposed schema changes for next fatcat iteration (v0.4? v0.5?).
@@ -17,6 +19,9 @@ SQL (and API, and elasticsearch):
- TODO: release: switch how pages work? first/last?
- TODO: indication of peer-review process? at release or container level?
- TODO: container: separate canonical and disambiguating titles (?)
+- TODO: release inter-references using SCHOLIX/Datacite schema
+ https://zenodo.org/record/1120265
+ https://support.datacite.org/docs/connecting-research-outputs#section-related-identifiers
API tweaks:
diff --git a/proposals/20191018_bigger_db.md b/proposals/20191018_bigger_db.md
index cd5f6e7b..7a5216d0 100644
--- a/proposals/20191018_bigger_db.md
+++ b/proposals/20191018_bigger_db.md
@@ -1,4 +1,8 @@
+Status: brainstorm
+
+## Catalog Database Scaling
+
How can we scale the fatcat backend to support:
- one billion release entities
diff --git a/proposals/20200103_py37_refactors.md b/proposals/20200103_py37_refactors.md
new file mode 100644
index 00000000..f0321b33
--- /dev/null
+++ b/proposals/20200103_py37_refactors.md
@@ -0,0 +1,101 @@
+
+status: planning
+
+If we update fatcat python code to python3.7, what code refactoring changes can
+we make? We currently use/require python3.5.
+
+Nice features in python3 I know of are:
+
+- dataclasses (python3.7)
+- async/await (mature in python3.7?)
+- type annotations (python3.5)
+- format strings (python3.6)
+- walrus assignment (python3.8)
+
+Not sure if the walrus operator is worth jumping all the way to python3.8.
+
+While we might be at it, what other superficial factorings might we want to do?
+
+- strict lint style (eg, maximum column width) with `black` (python3.6)
+- logging/debugging/verbose
+- type annotations and checking
+- use named dicts or structs in place of dicts
+
+## Linux Distro Support
+
+The default python version shipped by current and planned linux releases are:
+
+- ubuntu xenial 16.04 LTS: python3.5
+- ubuntu bionic 18.04 LTS: python3.6
+- ubuntu focal 20.04 LTS: python3.8 (planned)
+- debian buster 10 2019: python3.7
+
+Python 3.7 is the default in debian buster (10).
+
+There are apt PPA package repositories that allow backporting newer pythons to
+older releases. As far as I know this is safe and doesn't override any system
+usage if we are careful not to set the defaults (aka, `python3` command should
+be the older version unless inside a virtualenv).
+
+It would also be possible to use `pyenv` to have `virtualenv`s with custom
+python versions. We should probably do that for OS X and/or windows support if
+we wanted those. But having a system package is probably a lot faster to
+install.
+
+## Dataclasses
+
+`dataclasses` are a user-friendly way to create struct-like objects. They are
+pretty similar to the existing `namedtuple`, but can be mutable and have
+methods attached to them (they are just classes), plus several other usability
+improvements.
+
+Most places we are throwing around dicts with structure we could be using
+dataclasses instead. There are some instances of this in fatcat, but many more
+in sandcrawler.
+
+## Async/Await
+
+Where might we actually use async/await? I think more in sandcrawler than in
+the python tools or web apps. The GROBID, ingest, and ML workers in particular
+should be async over batches, as should all fetches from CDX/wayback.
+
+Some of the kafka workers *could* be aync, but i'm not sure how much speedup
+there would actually be. For example, the entity updates worker could fetch
+entities for an editgroup concurrently.
+
+Inserts (importers) should probably mostly happen serially, at least the kafka
+importers, one editgroup at a time, so progress is correctly recorded in kafka.
+Parallelization should probably happen at the partition level; would need to
+think through whether async would actually help with code simplicity vs. thread
+or process parallelization.
+
+## Type Annotations
+
+The meta-goals of (gradual) type annotations would be catching more bugs at
+development time, and having code be more self-documenting and easier to
+understand.
+
+The two big wins I see with type annotation would be having annotations
+auto-generated for the openapi classes and API calls, and to make string
+munging in importer code less buggy.
+
+## Format Strings
+
+Eg, replace code like:
+
+ "There are {} out of {} objects".format(found, total)
+
+With:
+
+ f"There are {found} out of {total} objects"
+
+## Walrus Operator
+
+New operator allows checking and assignment together:
+
+ if (n := len(a)) > 10:
+ print(f"List is too long ({n} elements, expected <= 10)")
+
+I feel like we would actually use this pattern *a ton* in importer code, where
+we do a lot of lookups or cleaning then check if we got a `None`.
+
diff --git a/proposals/README.md b/proposals/README.md
new file mode 100644
index 00000000..5e6747b1
--- /dev/null
+++ b/proposals/README.md
@@ -0,0 +1,11 @@
+
+This folder contains proposals for larger changes to the fatcat system. These
+might be schema changes, new projects, technical details, etc. Any change which
+is large enough to require planning and documentation.
+
+Each should be tagged with a date first drafted, and labeled with a status:
+
+- brainstorm: just putting ideas down; might not even happen
+- planned: commited to happening, but not started yet
+- work-in-progress: currently being worked on
+- implemented: completed, merged to master/production/live