diff options
-rw-r--r-- | README.md | 35 | ||||
-rw-r--r-- | TODO | 24 | ||||
-rw-r--r-- | guide/TODO | 10 |
3 files changed, 29 insertions, 40 deletions
@@ -8,23 +8,20 @@ ... catalog all the things! +This repository contains source code for 'fatcat', an editable catalog of +published written works (mostly journal articles), with a focus on tracking +the location and status of full-text copies to ensure "perpetual access". + The [RFC](./fatcat-rfc.md) is the original design document, and the best place to start for background. There is a work-in-progress "guide" at <https://guide.fatcat.wiki>; the canonical public location of this repository is <https://github.com/internetarchive/fatcat>. -There are four main components: - -- backend API server and database -- elasticsearch index -- API client libraries and bots (eg, ingesters) -- front-end web interface (built on API and library) +There are three main components: -The API server was prototyped in python. "Real" implementation started in -golang, but shifted to Rust, and is work-in-progress. The beginings of a client -library, web interface, and data ingesters exist in python. Elasticsearch index -is currently just a Crossref metadata dump and doesn't match entities in the -database/API (but is useful for paper lookups). +- backend API server and database (in Rust) +- API client libraries and bots (in Python) +- front-end web interface (in Python; built on API and library) See the LICENSE file for details permissions and licensing of both python and rust code. In short, the auto-generated client libraries are permissively @@ -32,26 +29,28 @@ released, while the API server and web interface are strong copyleft (AGPLv3). ## Status -- HTTP API - - [x] base32 encoding of UUID identifiers - - [x] inverse many-to-many helpers (files-by-release, release-by-creator) -- SQL Schema +- SQL and HTTP API schemas - [x] Basic entities - [x] one-to-many and many-to-many entities - [x] JSON(B) "extra" metadata fields - [x] full rev1 schema for all entities - [ ] editgroup review: comments? actions? + - [ ] file sets and web captures +- HTTP API Server + - [x] base32 encoding of UUID identifiers + - [x] inverse many-to-many helpers (files-by-release, release-by-creator) + - [ ] Authentication (eg, accounts, OAuth2, JWT) + - [ ] Authorization (aka, roles) - Web Interface - [x] Migrate Python codebase - [ ] Creation and editing of all entities - Other + - [x] Elasticsearch schema - [x] Basic logging - [x] Swagger-UI + - [x] Bulk metadata exports - [ ] Sentry (error reporting) - [ ] Metrics - - [ ] Authentication (eg, accounts, OAuth2, JWT) - - [ ] Authorization (aka, roles) - - [ ] bot vs. editor ## Identifiers @@ -2,28 +2,24 @@ ## Next Up - basic webface creation, editing, merging, editgroup approval -- elastic schema/transform for releases; bulk and continuous scripts -## QA Blockers +## Production blockers - refactors and correctness in rust/TODO - importers have editor accounts and include editgroup metadata - -## Production blockers - - enforce single-ident-edit-per-editgroup => entity_edit: entity_ident/entity_editgroup should be UNIQ index => UPDATE/REPLACE edits? - crossref importer sets release_type as "stub" when appropriate - re-implement old python tests -- real auth +- real authentication and authorization - metrics, jwt, config, sentry ## Metadata Import - manifest: multiple URLs per SHA1 - crossref: relations ("is-preprint-of") -- crossref: two phse: no citations, then matched citations (via DOI table) +- crossref: two phase: no citations, then matched citations (via DOI table) - container import (extra?): lang, region, subject - crossref: filter works => content-type whitelist @@ -35,8 +31,10 @@ new importers: - pubmed (medline) (filtered) => and/or, use pubmed ID lookups on crossref import +- arxiv.org +- DOAJ - CORE (filtered) -- semantic scholar (up to 39 million; author de-dupe) +- semantic scholar (up to 39 million; includes author de-dupe) ## Entity/Edit Lifecycle @@ -50,7 +48,7 @@ new importers: ## Guide / Book / Style -- release_type, release_status, url.rel schemas (and enforce in API?) +- release_type, release_status, url.rel schemas (enforced in API) - more+better terms+policies: https://tosdr.org/index.html ## Fun Features @@ -67,12 +65,15 @@ new importers: ## Schema / Entity Fields +- FileSet and WebSnapshot entities - `doi` field for containers (at least for "journal" type; maybe for "series" as well?) - `retracted`, `translation`, and perhaps `corrected` as flags on releases, instead of release_status? +- 'part-of' relation for releases (release to release) and possibly containers +- `container-type` field for containers (journal, conference, book series, etc) -## Other +## Other / Backburner - refactor openapi schema to use shared response types - consider using "HTTP 202: Accepted" for entity-mutating calls @@ -84,8 +85,7 @@ new importers: => proof-of-concept, no tests - add_header Strict-Transport-Security "max-age=3600"; => 12 hours? 24? -- elastic pipeline -- kong or oauth2_proxy for auth, rate-limit, etc +- haproxy for rate-limiting - feature flags: consul? - secrets: vault? - "authn" microservice: https://keratin.tech/ diff --git a/guide/TODO b/guide/TODO deleted file mode 100644 index 1c9b7110..00000000 --- a/guide/TODO +++ /dev/null @@ -1,10 +0,0 @@ -- scope - -- quick passes: spellcheck, " I ", "would/will" - -TODO -- roadmap -- revise 'implementation' page with details (hosting costs, etc) - -DONE -- policies |