From 0dc872921023030f6ffd320eb038e5379b47fa53 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 11 Sep 2018 13:56:53 -0700 Subject: update TODO lists (september plan) --- TODO | 120 +++++++++++++++++++++++++++---------------------------------------- 1 file changed, 48 insertions(+), 72 deletions(-) (limited to 'TODO') diff --git a/TODO b/TODO index 765f6a3a..900e8eda 100644 --- a/TODO +++ b/TODO @@ -1,29 +1,31 @@ ## Next Up -- some significant slow-down has happened? transactions, or regexes? -summer roadmap: -- PUT/UPDATE, DELETE, and merge code paths -- faster UPDATE-free bulk import code path -- container import (extra?): lang, region, subject -- basic API+webface creation, editing, merging, editgroup approval +- basic webface creation, editing, merging, editgroup approval - elastic schema/transform for releases; bulk and continuous scripts -features: -- fast database dump command: both changelog-based and entity-based (rust) - => lighter, more complete dumps for each entity type? -- guide skeleton (mdbook; guide.fatcat.wiki) +## QA Blockers + +- refactors and correctness in rust/TODO +- importers have editor accounts and include editgroup metadata +- crossref importer uses extids + +## Production blockers + +- enforce single-ident-edit-per-editgroup + => entity_edit: entity_ident/entity_editgroup should be UNIQ index + => UPDATE/REPLACE edits? +- crossref importer sets release_type as "stub" when appropriate +- re-implement old python tests +- real auth +- metrics, jwt, config, sentry + +## Metadata Import -importers: -- CORE -- wikidata cross-ref (if they have a dump) - manifest: multiple URLs per SHA1 -- pubmed (medline), if not in CORE - => and/or, use pubmed ID lookups on crossref import -- core -- semantic scholar (up to 39 million; author de-dupe) -- wikidata (if they have a dump) - crossref: relations ("is-preprint-of") +- crossref: two phse: no citations, then matched citations (via DOI table) +- container import (extra?): lang, region, subject - crossref: filter works => content-type whitelist => title length and title/slug blacklist @@ -31,61 +33,43 @@ importers: => make this a method on Release object => or just set release_stub as "stub"? -bugs: +new importers: +- pubmed (medline) (filtered) + => and/or, use pubmed ID lookups on crossref import +- CORE (filtered) +- semantic scholar (up to 39 million; author de-dupe) + +## Entity/Edit Lifecycle + +- redirects and merges (API, webface, etc) - test: release pointing to a collection that has been deleted/redirected => UI crash? +- commenting and accepting editgroups +- editgroup state machine? +- enforce "single ident edit per editgroup" + => how to "edit an edit"? clobber existing? -july roadmap: -- complete and test this round of schema changes -- container import (extra?): lang, region, subject -- re-run imports -- basic API+webface creation, editing, merging, editgroup approval -- elastic schema/transform for releases; bulk and continuous scripts - -## Schema / Alignment / Scope +## Guide / Book / Style -- "container" -> "venue"? -- release_type, release_status, url.rel write-time schema(and others?) +- release_type, release_status, url.rel schemas (and enforce in API?) name ref: https://www.w3.org/International/questions/qa-personal-names -## API - -- how to send edit "extra" metadata? -- hydrate entities in API - ? "expand" query param - -## High-Level Priorities - -- full database dump (export) -- manual editing of containers and releases (web interface) - -## Web UI - -- changelog more like a https://semantic-ui.com/views/feed.html ? -- instead of grid, maybe https://semantic-ui.com/elements/rail.html +## Fun Features -## Performance - -- write pure-rust "benchmark" scripts that hit, eg, lookups and batch - endpoints. run these with auto_explain on, then look in logs on dev machine -- batch inserts automerge: create editgroup and changelog, mark all edits as - accepted, all in a single transaction - -## API - -- hydrate entities in API - ? "expand" query param -- don't include abstracts by default? -- "stub" mode for lookups, returning only the ident (or maybe whole row)? - -## Database - -- test using hash indexes for some UUID column indexes, or at least sha1 and - other hashes (abstracts, file lookups) +- "save paper now" + => is it in GWB? if not, SPN + => get hash + url from GWB, verify mimetype acceptable + => is file in fatcat? + => what about HBase? GROBID? + => create edit, redirect user to editgroup submit page +- python client tool and library in pypi + => or maybe rust? +- bibtext (etc) export ## Other +- consider using "HTTP 202: Accepted" for entity-mutating calls - basic python hbase/elastic matcher => takes sha1 keys => checks fatcat API + hbase @@ -94,19 +78,11 @@ name ref: https://www.w3.org/International/questions/qa-personal-names => proof-of-concept, no tests - add_header Strict-Transport-Security "max-age=3600"; => 12 hours? 24? -- criterion.rs benchmarking -- schema.org metadata in webface -- bulk endpoint auto-merge mode (huge postgres speedup on import) - elastic pipeline - kong or oauth2_proxy for auth, rate-limit, etc +- feature flags: consul? +- secrets: vault? - "authn" microservice: https://keratin.tech/ -- PUT for mid-edit revisions -- 'parent rev' for revisions (vs. container parent) -- "submit" status for editgroups? - -review -- what does openlibrary API look like? -x add a 'live' (or 'immutable') flag to revision tables better API docs - https://sourcey.com/spectacle/ -- cgit v1.2.3