## In Progress ## Next Up - fileset/webcapture entities - authentication - remove the concept of "active editgroup", and simplify autoaccept batch path - fix returned error messages; should return type (shortname), and then actual message/description - handle wip entities in web UI - elastic inserter should handle deletions and redirects; if state isn't active, delete the document => don't delete, just store state. but need to "blank" redirects and WIP so they don't show up in results => refactor inserter to be a class (eg, for command line use) => end-to-end test of this behavior? ## Ideas - fast path to skip recursive redirect checks for bulk inserts - when getting "wip" entities, require a parameter ("allow_wip"), else get a 404 - consider dropping CORE identifier - maybe better 'success' return message? eg, "success: true" flag - idea: allow users to generate their own editgroup UUIDs, to reduce a round trips and "hanging" editgroups (created but never edited) - API: allow deletion of empty, un-accepted editgroups - refactor API schema for some entity-generic methos (eg, history, edit operations) to take entity type as a URL path param. greatly reduce macro foolery and method count/complexity, and ease creation of new entities => /{entity}/edit/{edit_id} => /{entity}/{ident}/redirects => /{entity}/{ident}/history ## Production blockers - refactors and correctness in rust/TODO - importers have editor accounts and include editgroup metadata - crossref importer sets release_type as "stub" when appropriate - real authentication and authorization - metrics, jwt, config, sentry - importers: don't insert wayback links with short timestamps ## Metadata Import - cleaning/matching: https://ftfy.readthedocs.io/en/latest/ - manifest: multiple URLs per SHA1 - crossref: relations ("is-preprint-of") - crossref: two phase: no citations, then matched citations (via DOI table) - container import (extra?): lang, region, subject - crossref: filter works => content-type whitelist => title length and title/slug blacklist => at least one author (?) => make this a method on Release object => or just set release_type as "stub"? new importers: - pubmed (medline) (filtered) => and/or, use pubmed ID lookups on crossref import - arxiv.org - DOAJ - CORE (filtered) - semantic scholar (up to 39 million; includes author de-dupe) ## Entity/Edit Lifecycle - commenting and accepting editgroups - editgroup state machine? ## Guide / Book / Style - release_type, release_status, url.rel schemas (enforced in API) - more+better terms+policies: https://tosdr.org/index.html ## Fun Features - "save paper now" => is it in GWB? if not, SPN => get hash + url from GWB, verify mimetype acceptable => is file in fatcat? => what about HBase? GROBID? => create edit, redirect user to editgroup submit page - python client tool and library in pypi => or maybe rust? - bibtext (etc) export ## Schema / Entity Fields - arxiv_id field (keep flip-flopping) - original_title field (?) - FileSet and WebCapture entities - `doi` field for containers (at least for "journal" type; maybe for "series" as well?) - `retracted`, `translation`, and perhaps `corrected` as flags on releases, instead of release_status? - 'part-of' relation for releases (release to release) and possibly containers - `container-type` field for containers (journal, conference, book series, etc) ## Other / Backburner - refactor openapi schema to use shared response types - consider using "HTTP 202: Accepted" for entity-mutating calls - basic python hbase/elastic matcher => takes sha1 keys => checks fatcat API + hbase => if not matched yet, tries elastic search => simple ~exact match heuristic => proof-of-concept, no tests - add_header Strict-Transport-Security "max-age=3600"; => 12 hours? 24? - haproxy for rate-limiting - feature flags: consul? - secrets: vault? - "authn" microservice: https://keratin.tech/ better API docs - readme.io has a free open source plan (or at least used to) - https://github.com/readmeio/api-explorer - https://github.com/lord/slate - https://sourcey.com/spectacle/ - https://github.com/DapperDox/dapperdox CSL: - https://citationstyles.org/ - https://github.com/citation-style-language/documentation/blob/master/primer.txt - https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html - https://github.com/citation-style-language/schema/blob/master/csl-types.rnc - perhaps a "create from CSL" endpoint?