1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
|
## In Progress
- update existing 1.5 mil longtail OA PDFs with container/ISSN-L
## Next Up
## Bugs
- did, somehow, end up with web.archive.org/web/None/ URLs (should remove)
- searching 'N/A' is a bug, because not quoted; auto-quote it?
- author (contrib) names not getting included in search (unless explicit)
- fatcat flask lookup ValueError should return 4xx (and message?)
## Next Schema Iteration (0.3.0)
Changes to SQL (and swagger):
- structured names in contribs (given/sur)
- `release_status` => `release_stage`
- `withdrawn_date`, `withdrawn_state`, and retraction as a release stage
- subtitle as a string field
=> but what about translation? `original_subtitle`? just combine them?
=> combine in elasticsearch 'title' field
- size on webcapture CDX lines (we fetch for sha256 anyways, so easy to calculate)
- `ark_id` release identifier
- `mag_id` (microsoft academic graph) release identifier
- releases: 'number' (eg, report numbers) and 'version' (for numbered variants) fields
- missing SQL indices: `ENTITY_edit.editgroup_id, ENTITY_edit.ident_id`
Changes to swagger only:
- edit URLs: `editgroup_id` in URL, not a query param
- changelog API endpoint should needs expand=editors option
- include 'created' in editgroup object (already in SQL)
## Next Full Release "Touch"
Will update all release entities (or at least all Crossref-derived entities).
Want to minimize edit counts, so will bundle a bunch of changes
- structured contrib names (given, sur)
- reference linking (release-to-release), via crossref DOI refs
- subtitle as string, not array
- remove crossref alt ids that are just the DOI (?)
## Production Public Launch Blockers
- view edit revisions in webface
- audit fatcat metadata for CC-0
- guide updates for auth
- privacy policy, and link from: create account, create edit
## Production Tech Sanity
- postgresql replication
- haproxy somewhere/how
- logging iteration: larger journald buffers? point somewhere?
## Unsorted
- ability to "edit edits" (update in-progress edits)
- review bots:
- tests
- not actually processing work entities
- filter out already reviewed
- handle deletions, merges
- examples of warnings, etc
- missing test coverage (python):
batch create work, fileset, webcapture
delete entity (for each entity type)
delete entity edits (for each entity type)
get entity edit (for each entity type)
get entity redirects (for each entity type)
get entity revision (for each entity type)
get release webcaptures
update editor (?)
update fileset, webcapture
release elastic transform (rich extra)
successful web entity edits (create fresh entities first)
editgroup web submit, accept, annotate
- API: ability to expand containers (and files, etc?) in releases-for-work
- API: /releases endpoint (and/or expansion) for releases-for-file (etc)
- cleanup ./notes/ directory
- links say "Download ..." but open in same page, not download
- workers (like entity updater) should use env vars more
- ansible: ISSN-L download/symlink
- page-one.live.cf.public.springer.com seems to serve up bogus one-pagers; should exclude
- QA sentry has very little host info; also not URL of request
- elastic schemas:
release: drop revision?; container_id; creator_id
should `release_year` be of date type, instead of int?
files: domain list; mimetype; release count; url count; web/publisher/etc;
size; has_md5/sha256/sha1; in_ia, in_shadow
- should elastic `release_year` be of date type, instead of int?
- webface: still need to collapse links by domain better, and also vs. www.x/x
- entity edit JSON objects could include `entity_type`
- refactor 'fatcatd' to 'fatcat-api'
- changelog elastic stuff (is there even a fatcat-export for this?)
- container count "enrich"
- 'hide' flag for exporter (eg, to skip abstracts and refs in some release dumps)
- https://tech.labs.oliverwyman.com/blog/2019/01/14/serialising-rust-tests/
- changelog elastic index (for stats)
- API: allow deletion of empty, un-accepted editgroups
## Ideas
- `poster` as a `release_type`
- "revert editgroup" mechanism (creates new editgroup)
- can guess some `release_status` of files by looking at wayback date vs.
published date
- ORCID apparently has 37 mil "work activities" (patents, etc), and only 14 mil
unique DOIs; could import those other "work activities"? do they have
identifiers?
- use https://github.com/codelucas/newspaper to extract fulltext+metadata from HTML crawls
- `fatcat-auth` tool should support more caveats, both when generating new or mutating existing tokens
- fast path to skip recursive redirect checks for bulk inserts
- when getting "wip" entities, require a parameter ("allow_wip"), else get a 404
- maybe better 'success' return message? eg, "success: true" flag
- idea: allow users to generate their own editgroup UUIDs, to reduce a round
trips and "hanging" editgroups (created but never edited)
- refactor API schema for some entity-generic methos (eg, history, edit
operations) to take entity type as a URL path param. greatly reduce macro
foolery and method count/complexity, and ease creation of new entities
=> /{entity}/edit/{edit_id}
=> /{entity}/{ident}/redirects
=> /{entity}/{ident}/history
- investigate data quality by looking at, eg, most popular author strings, most
popular titles, duplicated containers, etc
## Metadata Import
- 158 "NULL" publishers in journal metadata
- crossref: many ISBNs not getting copied; use python library to convert?
- remove 'first' from contrib crossref 'seq' (not helpful?)
- should probably check for 'jats:' in abstract before setting mimetype, even from crossref
- web.archive.org response not SHA1 match? => need /<dt>id_/ thing
- XML etc in metadata
=> (python) tests for these!
https://qa.fatcat.wiki/release/search?q=xmlns
https://qa.fatcat.wiki/release/search?q=%24gt
- bad/weird titles
"[Blank page]", "blank page"
"Temporary Empty DOI 0"
"ADVERTISEMENT"
"Full title page with Editorial board (with Elsevier tree)"
"Advisory Board Editorial Board"
- better/complete reltypes probably good (eg, list of IRs, academic domain)
- 'expand' in lookups (derp! for single hit lookups)
- include crossref-capitalized DOI in extra
- manifest: multiple URLs per SHA1
- crossref: relations ("is-preprint-of")
- crossref: two phase: no citations, then matched citations (via DOI table)
- special "alias" DOIs... in crossref metadata?
new importers:
- pubmed (medline) (filtered)
=> and/or, use pubmed ID lookups on crossref import
- arxiv.org
- DOAJ
- CORE (filtered)
- semantic scholar (up to 39 million; includes author de-dupe)
## Guide / Book / Style
- release_type, release_status, url.rel schemas (enforced in API)
- more+better terms+policies: https://tosdr.org/index.html
## Fun Features
- "save paper now"
=> is it in GWB? if not, SPN
=> get hash + url from GWB, verify mimetype acceptable
=> is file in fatcat?
=> what about HBase? GROBID?
=> create edit, redirect user to editgroup submit page
- python client tool and library in pypi
=> or maybe rust?
- bibtext (etc) export
## Metadata Harvesting
- datacite ingest seems to have failed... got a non-HTTP-200 status code, but also "got 50 (161950 of 21084)"
## Schema / Entity Fields
- elastic transform should only include authors, not editors (?)
- `retracted`, `translation`, and perhaps `corrected` as flags on releases, instead of release_status?
=> see notes file on retractions, etc
- 'part-of' relation for releases (release to release, eg for book chapters) and possibly containers
- `container_type` for containers (journal, conference, book series, etc)
=> in schema, needs vocabulary and implementation
## API Schema / Design
- refactor entity mutation (CUD) endpoints to be like `/editgroup/{editgroup_id}/release/{ident}`
=> changes editgroup_id from query param to URL param
- refactor bulk POST to include editgroup plus array of entity objects (instead of just a couple fields as query params)
## Web Interface
- include that ISO library to do lang/country name decodes
- container-name when no `container_id`. eg: 10.1016/b978-0-08-037302-7.50022-7
- fileset/webcapture webface anything
## Other / Backburner
- file entity full update with all hashes, file size, corrected/expanded wayback links
=> some number of files *did* get inserted to fatcat with short (year) datetimes, from old manifest. also no file size.
- regression test imports for missing orcid display and journal metadata name
- try out beautifulsoup? (https://stackoverflow.com/a/34532382/4682349)
- `doi` field for containers (at least for "journal" type; maybe for "series" as well?)
- refactor webface views to use shared entity_view.html template
- shadow library manifest importer
- book identifiers: OCLC, openlibrary
- ref from guide: https://creativecommons.org/2012/08/14/library-catalog-metadata-open-licensing-or-public-domain/
- test redirect/delete elasticsearch change
- fake DOI (use in examples): 10.5555/12345678
- refactor elasticsearch inserter to be a class (eg, for command line use)
- document: elastic query date syntax is like: date:[2018-10-01 TO 2018-12-31]
- display abstracts better. no hashes or metadata; prefer plain or HTML,
convert JATS if necessary
- switch from slog to simple pretty_env_log
- format returned datetimes with only second precision, not millisecond (RFC mode)
=> burried in model serialization internals
- refactor openapi schema to use shared response types
- consider using "HTTP 202: Accepted" for entity-mutating calls
- basic python hbase/elastic matcher
=> takes sha1 keys
=> checks fatcat API + hbase
=> if not matched yet, tries elastic search
=> simple ~exact match heuristic
=> proof-of-concept, no tests
- add_header Strict-Transport-Security "max-age=3600";
=> 12 hours? 24?
- haproxy for rate-limiting
better API docs
- readme.io has a free open source plan (or at least used to)
- https://github.com/readmeio/api-explorer
- https://github.com/lord/slate
- https://sourcey.com/spectacle/
- https://github.com/DapperDox/dapperdox
CSL:
- https://citationstyles.org/
- https://github.com/citation-style-language/documentation/blob/master/primer.txt
- https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html
- https://github.com/citation-style-language/schema/blob/master/csl-types.rnc
- perhaps a "create from CSL" endpoint?
|