1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
|
## Next Up
- basic webface creation, editing, merging, editgroup approval
## Production blockers
- refactors and correctness in rust/TODO
- importers have editor accounts and include editgroup metadata
- enforce single-ident-edit-per-editgroup
=> entity_edit: entity_ident/entity_editgroup should be UNIQ index
=> UPDATE/REPLACE edits?
- crossref importer sets release_type as "stub" when appropriate
- re-implement old python tests
- real authentication and authorization
- metrics, jwt, config, sentry
## Metadata Import
- manifest: multiple URLs per SHA1
- crossref: relations ("is-preprint-of")
- crossref: two phase: no citations, then matched citations (via DOI table)
- container import (extra?): lang, region, subject
- crossref: filter works
=> content-type whitelist
=> title length and title/slug blacklist
=> at least one author (?)
=> make this a method on Release object
=> or just set release_stub as "stub"?
new importers:
- pubmed (medline) (filtered)
=> and/or, use pubmed ID lookups on crossref import
- arxiv.org
- DOAJ
- CORE (filtered)
- semantic scholar (up to 39 million; includes author de-dupe)
## Entity/Edit Lifecycle
- redirects and merges (API, webface, etc)
- test: release pointing to a collection that has been deleted/redirected
=> UI crash?
- commenting and accepting editgroups
- editgroup state machine?
- enforce "single ident edit per editgroup"
=> how to "edit an edit"? clobber existing?
## Guide / Book / Style
- release_type, release_status, url.rel schemas (enforced in API)
- more+better terms+policies: https://tosdr.org/index.html
## Fun Features
- "save paper now"
=> is it in GWB? if not, SPN
=> get hash + url from GWB, verify mimetype acceptable
=> is file in fatcat?
=> what about HBase? GROBID?
=> create edit, redirect user to editgroup submit page
- python client tool and library in pypi
=> or maybe rust?
- bibtext (etc) export
## Schema / Entity Fields
- arxiv_id field (keep flip-flopping)
- original_title field (?)
- FileSet and WebSnapshot entities
- `doi` field for containers (at least for "journal" type; maybe for "series"
as well?)
- `retracted`, `translation`, and perhaps `corrected` as flags on releases,
instead of release_status?
- 'part-of' relation for releases (release to release) and possibly containers
- `container-type` field for containers (journal, conference, book series, etc)
## Other / Backburner
- look at: https://ftfy.readthedocs.io/en/latest/
- refactor openapi schema to use shared response types
- consider using "HTTP 202: Accepted" for entity-mutating calls
- basic python hbase/elastic matcher
=> takes sha1 keys
=> checks fatcat API + hbase
=> if not matched yet, tries elastic search
=> simple ~exact match heuristic
=> proof-of-concept, no tests
- add_header Strict-Transport-Security "max-age=3600";
=> 12 hours? 24?
- haproxy for rate-limiting
- feature flags: consul?
- secrets: vault?
- "authn" microservice: https://keratin.tech/
better API docs
- readme.io has a free open source plan (or at least used to)
- https://github.com/readmeio/api-explorer
- https://github.com/lord/slate
- https://sourcey.com/spectacle/
- https://github.com/DapperDox/dapperdox
CSL:
- https://citationstyles.org/
- https://github.com/citation-style-language/documentation/blob/master/primer.txt
- https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html
- https://github.com/citation-style-language/schema/blob/master/csl-types.rnc
- perhaps a "create from CSL" endpoint?
|