TODO.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227


## In Progress

## Prod Metadata Checks

x edit and editgroup metadata
x crossref citation not saving 'article-title' or 'unstructured', and 'author'
  should be 'authors' (list)
x crossref not saving 'language' (looks like iso code already)
- longtail_oa flag getting set on GROBID imports
- grobid reference should be under extra (not nested): issue, volume, authors
- uniqueness of:
    sha1 - via SQL dump
    doi - via SQL dump
    issnl - via JSON dump
    orcid - via JSON dump

notes:
- crossref references look great!
- extra/crossref/alternative-id often includes exact full DOI
        10.1158/1538-7445.AM10-3529
        10.1158/1538-7445.am10-3529
    => but not always? publisher-specific
- contribs[]/extra/seq often has "first" from crossref
    => is this helpful?
- abstracts content is fine, but should probably check for "jats:" when setting
  mimetype
x BUG: `license_slug` when https://creativecommons.org/licenses/by-nc-sa/4.0
    => https://api.qa.fatcat.wiki/v0/release/55y37c3dtfcw3nw5owugwwhave
       10.26891/jik.v10i2.2016.92-97
- original title works, yay!
    https://api.qa.fatcat.wiki/v0/release/nlmnplhrgbdalcy472hfb2z3im
    10.2504/kds.26.358
- new license: https://www.karger.com/Services/SiteLicenses
- not copying ISBNs: 10.1016/b978-0-08-037302-7.50022-7
    "9780080373027"
    could at least put in alternative-id?
- BUG: subtitle coming through as an array, not string
- `license_slug` does get set
    eg for PLOS ONE http://creativecommons.org/licenses/by/4.0/

## Next Up

- bootstrap_bots script should set -ex and output admin and webface tokens
- regression test imports for missing orcid display and journal metadata name
- serveral tweaks/fixes to webface (eg, container metadata schema changed)
- container count "enrich"
- changelog elastic stuff (is there even a fatcat-export for this?)
- QA sentry has very little host info; also not URL of request
- start prod crossref harvesting (from ~start of 2019)
- 158 "NULL" publishers in journal metadata
- should elastic release_year be of date type, instead of int?
- QA/prod needs updated credentials
- ansible: ISSN-L download/symlink
- searching 'N/A' is a bug
- formalize release_status:
    => https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings
- entity edit JSON objects could include `entity_type`

## Production public launch blockers

- handle 'wip' status entities in web UI
- guide updates for auth
- privacy policy, and link from: create account, create edit
- refactors and correctness in rust/TODO
- update /about page

## Production Tech Sanity

- postgresql replication
- pg_dump/load test
- haproxy somewhere/how
- logging iteration: larger journald buffers? point somewhere?

## Ideas

- ORCID apparently has 37 mil "work activities" (patents, etc), and only 14 mil
  unique DOIs; could import those other "work activities"? do they have
  identifiers?
- write up notes on biblio metadata in general
    => "extensibility" and extra keys
    => proliferation of arrays vs. concrete values
    => various ways to record history/progeny
    => "subtitle", "short-title", "full-title" complexity
    => human names
    => translated metadata: titles/names/abstracts
    => "typing" for metadata (eg, math in titles)
- 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps)
- https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/
- use https://github.com/codelucas/newspaper to extract fulltext+metadata from
  HTML crawls
- changelog elastic index (for stats)
- import from arabesque output (eg, specific crawls)
- more logins: orcid, wikimedia
- `fatcat-auth` tool should support more caveats, both when generating new or
  mutating existing tokens
- fast path to skip recursive redirect checks for bulk inserts
- when getting "wip" entities, require a parameter ("allow_wip"), else get a
  404
- consider dropping CORE identifier
- maybe better 'success' return message? eg, "success: true" flag
- idea: allow users to generate their own editgroup UUIDs, to reduce a round
  trips and "hanging" editgroups (created but never edited)
- API: allow deletion of empty, un-accepted editgroups
- refactor API schema for some entity-generic methos (eg, history, edit
  operations) to take entity type as a URL path param. greatly reduce macro
  foolery and method count/complexity, and ease creation of new entities
    => /{entity}/edit/{edit_id}
    => /{entity}/{ident}/redirects
    => /{entity}/{ident}/history
- investigate data quality by looking at, eg, most popular author strings, most
  popular titles, duplicated containers, etc

## Metadata Import

- web.archive.org response not SHA1 match? => need /<dt>id_/ thing
- XML etc in metadata
    => (python) tests for these!
    https://qa.fatcat.wiki/release/search?q=xmlns
    https://qa.fatcat.wiki/release/search?q=%24gt
- bad/weird titles
    "[Blank page]", "blank page"
    "Temporary Empty DOI 0"
    "ADVERTISEMENT"
    "Full title page with Editorial board (with Elsevier tree)"
    "Advisory Board Editorial Board"
- better/complete reltypes probably good (eg, list of IRs, academic domain)
- 'expand' in lookups (derp! for single hit lookups)
- include crossref-capitalized DOI in extra
- some "Elsevier " stuff as publisher
    => also title https://fatcat.wiki/release/uyjzaq3xjnd6tcrqy3vcucczsi
- crossref import: don't store citation unstructured if len() == 0:
    {"crossref": {"unstructured": ""}}
- try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349)
- manifest: multiple URLs per SHA1
- crossref: relations ("is-preprint-of")
- crossref: two phase: no citations, then matched citations (via DOI table)
- special "alias" DOIs... in crossref metadata?

new importers:
- pubmed (medline) (filtered)
    => and/or, use pubmed ID lookups on crossref import
- arxiv.org
- DOAJ
- CORE (filtered)
- semantic scholar (up to 39 million; includes author de-dupe)

## Guide / Book / Style

- release_type, release_status, url.rel schemas (enforced in API)
- more+better terms+policies: https://tosdr.org/index.html

## Fun Features

- "save paper now"
    => is it in GWB? if not, SPN
    => get hash + url from GWB, verify mimetype acceptable
    => is file in fatcat?
    => what about HBase? GROBID?
    => create edit, redirect user to editgroup submit page
- python client tool and library in pypi
    => or maybe rust?
- bibtext (etc) export

## Metadata Harvesting

- datacite ingest seems to have failed... got a non-HTTP-200 status code, but also "got 50 (161950 of 21084)"

## Schema / Entity Fields

- elastic transform should only include authors, not editors (?)
- `doi` field for containers (at least for "journal" type; maybe for "series"
  as well?)
- `retracted`, `translation`, and perhaps `corrected` as flags on releases,
  instead of release_status?
    => see notes file on retractions, etc
- 'part-of' relation for releases (release to release, eg for book chapters)
  and possibly containers
- `container_type` for containers (journal, conference, book series, etc)
    => in schema, needs vocabulary and implementation

## Web Interface

- include that ISO library to do lang/country name decodes
- container-name when no `container_id`. eg: 10.1016/b978-0-08-037302-7.50022-7

## Other / Backburner

- refactor webface views to use shared entity_view.html template
- shadow library manifest importer
- book identifiers: OCLC, openlibrary
- ref from guide: https://creativecommons.org/2012/08/14/library-catalog-metadata-open-licensing-or-public-domain/
- test redirect/delete elasticsearch change
- fake DOI (use in examples): 10.5555/12345678
- refactor elasticsearch inserter to be a class (eg, for command line use)
- document: elastic query date syntax is like: date:[2018-10-01 TO 2018-12-31]
- fileset/webcapture webface anything
- display abstracts better. no hashes or metadata; prefer plain or HTML,
  convert JATS if necessary
- switch from slog to simple pretty_env_log
- format returned datetimes with only second precision, not millisecond (RFC mode)
    => burried in model serialization internals
- refactor openapi schema to use shared response types
- consider using "HTTP 202: Accepted" for entity-mutating calls
- basic python hbase/elastic matcher
  => takes sha1 keys
  => checks fatcat API + hbase
  => if not matched yet, tries elastic search
  => simple ~exact match heuristic
  => proof-of-concept, no tests
- add_header Strict-Transport-Security "max-age=3600";
    => 12 hours? 24?
- haproxy for rate-limiting

better API docs
- readme.io has a free open source plan (or at least used to)
- https://github.com/readmeio/api-explorer
- https://github.com/lord/slate
- https://sourcey.com/spectacle/
- https://github.com/DapperDox/dapperdox

CSL:
- https://citationstyles.org/
- https://github.com/citation-style-language/documentation/blob/master/primer.txt
- https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html
- https://github.com/citation-style-language/schema/blob/master/csl-types.rnc
- perhaps a "create from CSL" endpoint?