summaryrefslogtreecommitdiffstats
path: root/TODO
blob: 6417668de5a7bd4f4888cbd23cf600a7ac01de1d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195

## In Progress

- basic python tests for editgroup, annotation, submission changes
- python tests for new autoaccept behavior
- python tests for citation table storage efficiency changes
    => should there be a distinction between empty list and no references?
       yes, eg if expanded or not hidden
    => postgres manual checks that this is working
    => also benchmark (both speed and efficiency)

## Next Up

- "don't clobber" mode/flag for crossref import (and others?)
- update_file requires 'id'. should it be 'ident'?
    => something different about file vs. release
- guide updates for auth
- refactor webface views to use shared entity_view.html template
- handle 'wip' status entities in web UI
- elastic inserter should handle deletions and redirects; if state isn't
  active, delete the document
    => don't delete, just store state. but need to "blank" redirects and WIP so
       they don't show up in results
    => refactor inserter to be a class (eg, for command line use)
    => end-to-end test of this behavior?
- date handling is really pretty bad for releases; mangling those Jan1/Dec31 
    => elastic schema should have a year field (integer)
- document: elastic query date syntax is like: date:[2018-10-01 TO 2018-12-31]
- elastic transform should only include authors, not editors (?)
- webcapture timestamp schema cleanup (both CDX and base)
    => dt.to_rfc3339_opts(SecondsFormat::Secs, true)
    => but this is mostly buried in serialization code?
- fake DOI (use in examples): 10.5555/12345678
- URL location duplication (especially IA/wayback)
    => eg, https://fatcat.wiki/file/2g4sz57j3bgcfpwkgz5bome3re
    => UNIQ index on {release_rev, url}?
- shadow library manifest importer
- import from arabesque output (eg, specific crawls)
- elastic iteration
    => any_abstract broken?
    => blank author names? maybe in crossref import; fatcat-api and schema
       should both prevent
- handle very large author/reference lists (instead of dropping)
    => https://api.crossref.org/v1/works/http://dx.doi.org/10.1007/978-3-319-46095-6_7
    => 7000+ authors (!)

## Bugs (or at least need tests)

- autoaccept seems to have silently not actually merged editgroup

## Ideas

- more logins: orcid, wikimedia
- `fatcat-auth` tool should support more caveats, both when generating new or
  mutating existing tokens
- fast path to skip recursive redirect checks for bulk inserts
- when getting "wip" entities, require a parameter ("allow_wip"), else get a
  404
- consider dropping CORE identifier
- maybe better 'success' return message? eg, "success: true" flag
- idea: allow users to generate their own editgroup UUIDs, to reduce a round
  trips and "hanging" editgroups (created but never edited)
- API: allow deletion of empty, un-accepted editgroups
- refactor API schema for some entity-generic methos (eg, history, edit
  operations) to take entity type as a URL path param. greatly reduce macro
  foolery and method count/complexity, and ease creation of new entities
    => /{entity}/edit/{edit_id}
    => /{entity}/{ident}/redirects
    => /{entity}/{ident}/history
- investigate data quality by looking at, eg, most popular author strings, most
  popular titles, duplicated containers, etc

## Production blockers

- privacy policy, and link from: create account, create edit
- update /about page
- refactors and correctness in rust/TODO
- importers: don't insert wayback links with short timestamps

## Production Sanity

- fatcat-web is not Type=simple (systemd)
- postgresql replication
- pg_dump/load test
- haproxy somewhere/how
- logging iteration: larger journald buffers? point somewhere?

## Metadata Import

- web.archive.org response not SHA1 match? => need /<dt>id_/ thing
- XML etc in metadata
    => (python) tests for these!
    https://qa.fatcat.wiki/release/b3a2jvhvbvc6rlbdkpw4ukuzyi
    https://qa.fatcat.wiki/release/search?q=xmlns
    https://qa.fatcat.wiki/release/search?q=%26amp%3B
    https://qa.fatcat.wiki/release/search?q=%26gt%3B
- better/complete reltypes probably good (eg, list of IRs, academic domain)
- 'expand' in lookups (derp! for single hit lookups)
- include crossref-capitalized DOI in extra
- some "Elsevier " stuff as publisher
    => also title https://fatcat.wiki/release/uyjzaq3xjnd6tcrqy3vcucczsi
- crossref import: don't store citation unstructured if len() == 0:
    {"crossref": {"unstructured": ""}}
- cleaning/matching: https://ftfy.readthedocs.io/en/latest/
    => and try out beautifulsoup (https://stackoverflow.com/a/34532382/4682349)
- manifest: multiple URLs per SHA1
- crossref: relations ("is-preprint-of")
- crossref: two phase: no citations, then matched citations (via DOI table)
- container import (extra?): lang, region, subject
- crossref: filter works
    => content-type whitelist
    => title length and title/slug blacklist
    => at least one author (?)
    => make this a method on Release object
    => or just set release_type as "stub"?
- special "alias" DOIs... in crossref metadata?

new importers:
- pubmed (medline) (filtered)
    => and/or, use pubmed ID lookups on crossref import
- arxiv.org
- DOAJ
- CORE (filtered)
- semantic scholar (up to 39 million; includes author de-dupe)

## Entity/Edit Lifecycle

- commenting and accepting editgroups
- editgroup state machine?

## Guide / Book / Style

- release_type, release_status, url.rel schemas (enforced in API)
- more+better terms+policies: https://tosdr.org/index.html

## Fun Features

- "save paper now"
    => is it in GWB? if not, SPN
    => get hash + url from GWB, verify mimetype acceptable
    => is file in fatcat?
    => what about HBase? GROBID?
    => create edit, redirect user to editgroup submit page
- python client tool and library in pypi
    => or maybe rust?
- bibtext (etc) export

## Metadata Harvesting

- datacite ingest seems to have failed... got a non-HTTP-200 status code, but also "got 50 (161950 of 21084)"

## Schema / Entity Fields

- arxiv_id field (keep flip-flopping)
- original_title field (internationalization, "original language")
- `doi` field for containers (at least for "journal" type; maybe for "series"
  as well?)
- `retracted`, `translation`, and perhaps `corrected` as flags on releases,
  instead of release_status?
- 'part-of' relation for releases (release to release) and possibly containers
- `container_type` field for containers (journal, conference, book series, etc)

## Other / Backburner

- fileset/webcapture webface anything
- display abstracts better. no hashes or metadata; prefer plain or HTML,
  convert JATS if necessary
- switch from slog to simple pretty_env_log
- format returned datetimes with only second precision, not millisecond (RFC mode)
    => burried in model serialization internals
- refactor openapi schema to use shared response types
- consider using "HTTP 202: Accepted" for entity-mutating calls
- basic python hbase/elastic matcher
  => takes sha1 keys
  => checks fatcat API + hbase
  => if not matched yet, tries elastic search
  => simple ~exact match heuristic
  => proof-of-concept, no tests
- add_header Strict-Transport-Security "max-age=3600";
    => 12 hours? 24?
- haproxy for rate-limiting

better API docs
- readme.io has a free open source plan (or at least used to)
- https://github.com/readmeio/api-explorer
- https://github.com/lord/slate
- https://sourcey.com/spectacle/
- https://github.com/DapperDox/dapperdox

CSL:
- https://citationstyles.org/
- https://github.com/citation-style-language/documentation/blob/master/primer.txt
- https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html
- https://github.com/citation-style-language/schema/blob/master/csl-types.rnc
- perhaps a "create from CSL" endpoint?