aboutsummaryrefslogtreecommitdiffstats
path: root/TODO
blob: 7692444bf09d8aab383c8b259c804fc32d4eb6f5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

## In Progress

## Next Up

- fileset/webcapture entities
- authentication
- remove the concept of "active editgroup", and simplify autoaccept batch path
- fix returned error messages; should return type (shortname), and then actual
  message/description
- handle wip entities in web UI
- elastic inserter should handle deletions and redirects; if state isn't
  active, delete the document
    => don't delete, just store state. but need to "blank" redirects and WIP so
       they don't show up in results
    => refactor inserter to be a class (eg, for command line use)
    => end-to-end test of this behavior?

## Ideas

- fast path to skip recursive redirect checks for bulk inserts
- when getting "wip" entities, require a parameter ("allow_wip"), else get a
  404
- consider dropping CORE identifier
- maybe better 'success' return message? eg, "success: true" flag
- idea: allow users to generate their own editgroup UUIDs, to reduce a round
  trips and "hanging" editgroups (created but never edited)
- API: allow deletion of empty, un-accepted editgroups
- refactor API schema for some entity-generic methos (eg, history, edit
  operations) to take entity type as a URL path param. greatly reduce macro
  foolery and method count/complexity, and ease creation of new entities
    => /{entity}/edit/{edit_id}
    => /{entity}/{ident}/redirects
    => /{entity}/{ident}/history

## Production blockers

- refactors and correctness in rust/TODO
- importers have editor accounts and include editgroup metadata
- crossref importer sets release_type as "stub" when appropriate
- real authentication and authorization
- metrics, jwt, config, sentry
- importers:  don't insert wayback links with short timestamps

## Metadata Import

- cleaning/matching: https://ftfy.readthedocs.io/en/latest/
- manifest: multiple URLs per SHA1
- crossref: relations ("is-preprint-of")
- crossref: two phase: no citations, then matched citations (via DOI table)
- container import (extra?): lang, region, subject
- crossref: filter works
    => content-type whitelist
    => title length and title/slug blacklist
    => at least one author (?)
    => make this a method on Release object
    => or just set release_type as "stub"?

new importers:
- pubmed (medline) (filtered)
    => and/or, use pubmed ID lookups on crossref import
- arxiv.org
- DOAJ
- CORE (filtered)
- semantic scholar (up to 39 million; includes author de-dupe)

## Entity/Edit Lifecycle

- commenting and accepting editgroups
- editgroup state machine?

## Guide / Book / Style

- release_type, release_status, url.rel schemas (enforced in API)
- more+better terms+policies: https://tosdr.org/index.html

## Fun Features

- "save paper now"
    => is it in GWB? if not, SPN
    => get hash + url from GWB, verify mimetype acceptable
    => is file in fatcat?
    => what about HBase? GROBID?
    => create edit, redirect user to editgroup submit page
- python client tool and library in pypi
    => or maybe rust?
- bibtext (etc) export

## Schema / Entity Fields

- arxiv_id field (keep flip-flopping)
- original_title field (?)
- FileSet and WebCapture entities
- `doi` field for containers (at least for "journal" type; maybe for "series"
  as well?)
- `retracted`, `translation`, and perhaps `corrected` as flags on releases,
  instead of release_status?
- 'part-of' relation for releases (release to release) and possibly containers
- `container-type` field for containers (journal, conference, book series, etc)

## Other / Backburner

- refactor openapi schema to use shared response types
- consider using "HTTP 202: Accepted" for entity-mutating calls
- basic python hbase/elastic matcher
  => takes sha1 keys
  => checks fatcat API + hbase
  => if not matched yet, tries elastic search
  => simple ~exact match heuristic
  => proof-of-concept, no tests
- add_header Strict-Transport-Security "max-age=3600";
    => 12 hours? 24?
- haproxy for rate-limiting
- feature flags: consul?
- secrets: vault?
- "authn" microservice: https://keratin.tech/

better API docs
- readme.io has a free open source plan (or at least used to)
- https://github.com/readmeio/api-explorer
- https://github.com/lord/slate
- https://sourcey.com/spectacle/
- https://github.com/DapperDox/dapperdox

CSL:
- https://citationstyles.org/
- https://github.com/citation-style-language/documentation/blob/master/primer.txt
- https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html
- https://github.com/citation-style-language/schema/blob/master/csl-types.rnc
- perhaps a "create from CSL" endpoint?