aboutsummaryrefslogtreecommitdiffstats
path: root/TODO
blob: e1a2417dacbb8d25f66115586706fb00cdc32dc0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142

## Next Up

- all errors should result in transaction rollback
- test: re-deleting a deleted entity should be 4xx, not 5xx
- test: can't delete an accepted edit
- test: redirect to a WIP row
- test: revert to current version (should be disallowed)
- test: hide/expand in lookups
- test: python get revision endpoint, whether accepted or now
- test: GET redirects endpoint
- test: additional edits, editgroup already accepted
- test: prev_rev in various cases
- test: release pointing to a collection that has been deleted/redirected (etc)
- many webface tests:
    => entity redirects, wip, deletions
    => sub-entity redirects, wip, deletions
- test: new release points to new container; then container deleted from
    editgroup and editgroup accepted
    => also web UI
- handle wip, deleted, redirects in web UI
- remove the concept of "active editgroup"
- elastic inserter should handle deletions and redirects; if state isn't
  active, delete the document
    => and an end-to-end test of this behavior. hoo-boy.
- release_year (in addition to date)

## Ideas

- fast path to skip recursive redirect checks for bulk inserts
- when getting "wip" entities, require a parameter ("allow_wip"), else get a
  404
- consider dropping CORE identifier
- fix returned error messages; should return type (shortname), and then actual
  message/description
- maybe better 'success' return message? eg, "success: true" flag
- idea: allow users to generate their own editgroup UUIDs, to reduce a round
  trips and "hanging" editgroups (created but never edited)
- API: allow deletion of empty, un-accepted editgroups
- refactor API schema for some entity-generic methos (eg, history, edit
  operations) to take entity type as a URL path param. greatly reduce macro
  foolery and method count/complexity, and ease creation of new entities
    => /{entity}/edit/{edit_id}
    => /{entity}/{ident}/redirects
    => /{entity}/{ident}/history

## Production blockers

- refactors and correctness in rust/TODO
- importers have editor accounts and include editgroup metadata
- crossref importer sets release_type as "stub" when appropriate
- real authentication and authorization
- metrics, jwt, config, sentry
- importers:  don't insert wayback links with short timestamps

## Metadata Import

- manifest: multiple URLs per SHA1
- crossref: relations ("is-preprint-of")
- crossref: two phase: no citations, then matched citations (via DOI table)
- container import (extra?): lang, region, subject
- crossref: filter works
    => content-type whitelist
    => title length and title/slug blacklist
    => at least one author (?)
    => make this a method on Release object
    => or just set release_stub as "stub"?

new importers:
- pubmed (medline) (filtered)
    => and/or, use pubmed ID lookups on crossref import
- arxiv.org
- DOAJ
- CORE (filtered)
- semantic scholar (up to 39 million; includes author de-dupe)

## Entity/Edit Lifecycle

- redirects and merges (webface, etc)
- commenting and accepting editgroups
- editgroup state machine?

## Guide / Book / Style

- release_type, release_status, url.rel schemas (enforced in API)
- more+better terms+policies: https://tosdr.org/index.html

## Fun Features

- "save paper now"
    => is it in GWB? if not, SPN
    => get hash + url from GWB, verify mimetype acceptable
    => is file in fatcat?
    => what about HBase? GROBID?
    => create edit, redirect user to editgroup submit page
- python client tool and library in pypi
    => or maybe rust?
- bibtext (etc) export

## Schema / Entity Fields

- arxiv_id field (keep flip-flopping)
- original_title field (?)
- FileSet and WebSnapshot entities
- `doi` field for containers (at least for "journal" type; maybe for "series"
  as well?)
- `retracted`, `translation`, and perhaps `corrected` as flags on releases,
  instead of release_status?
- 'part-of' relation for releases (release to release) and possibly containers
- `container-type` field for containers (journal, conference, book series, etc)

## Other / Backburner

- look at: https://ftfy.readthedocs.io/en/latest/
- refactor openapi schema to use shared response types
- consider using "HTTP 202: Accepted" for entity-mutating calls
- basic python hbase/elastic matcher
  => takes sha1 keys
  => checks fatcat API + hbase
  => if not matched yet, tries elastic search
  => simple ~exact match heuristic
  => proof-of-concept, no tests
- add_header Strict-Transport-Security "max-age=3600";
    => 12 hours? 24?
- haproxy for rate-limiting
- feature flags: consul?
- secrets: vault?
- "authn" microservice: https://keratin.tech/

better API docs
- readme.io has a free open source plan (or at least used to)
- https://github.com/readmeio/api-explorer
- https://github.com/lord/slate
- https://sourcey.com/spectacle/
- https://github.com/DapperDox/dapperdox

CSL:
- https://citationstyles.org/
- https://github.com/citation-style-language/documentation/blob/master/primer.txt
- https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html
- https://github.com/citation-style-language/schema/blob/master/csl-types.rnc
- perhaps a "create from CSL" endpoint?