aboutsummaryrefslogtreecommitdiffstats
path: root/TODO
blob: 299d20850b56d4ebc0a935bf0750358d437e10ef (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

## Next Up

bugs:
- test: release pointing to a collection that has been deleted/redirected
  => UI crash?

schema:
- primary key types
    => idents as base32
    => editor_id and editgroup as idents
    => revisions as UUID
- multiple URLs per file
    => {type, url} table; display code to chose "best"
    => web, repo, webarchive, shadow (?)
- external idents (as columns)
    => pm_id
    => pmc_id
    => wikidata_id (creator, release, container)
    => oclc_id
    => viaf_id (creator)
- release_ref
    => 'raw'/'extra' json column
        => title
        => url
        => doi
        => etc...
    => citaion ID (`oci_id`)
    => release_id
- release_contrib
    => add 'raw' json column? or just extra?
- abstracts
    => new table; primary key SHA-1
    => release has multiple: {markup, lang, abstract_sha1}
- other changes (see notebook)
    => parent rev in edit table
    => timestamp columns
- "container" -> "venue"?

features:
- fast database dump command: both changelog-based and entity-based (rust)

importers:
- pubmed (medline)
- core
- semantic scholar (up to 39 million; author de-dupe)
- wikidata (if they have a dump)

other:
- update RFC
- basic python hbase/elastic matcher
  => takes sha1 keys
  => checks fatcat API + hbase
  => if not matched yet, tries elastic search
  => simple ~exact match heuristic
  => proof-of-concept, no tests


## Schema / Alignment / Scope

- abstracts! as files? separate table? format (latex, html, etc)?
    => crossref has ~13% as JATS; plus pubmed, plus arxiv
- work_type, release_type, release_status

name ref: https://www.w3.org/International/questions/qa-personal-names

## High-Level Priorities

- full database dump and reload (import/export)
- manual editing of containers and releases (web interface)

## Web UI

- changelog more like a https://semantic-ui.com/views/feed.html ?
- instead of grid, maybe https://semantic-ui.com/elements/rail.html

## Performance

- write pure-rust "benchmark" scripts that hit, eg, lookups and batch
  endpoints. run these with auto_explain on, then look in logs on dev machine
- batch inserts automerge: create editgroup and changelog, mark all edits as
  accepted, all in a single transaction

## API

- hydrate entities in API
    ? "expand" query param
    ? "full entity" field
    ? refactor file_releases to have objects as type

## Other

- schema.org metadata in webface
- bulk endpoint auto-merge mode (huge postgres speedup on import)
- elastic pipeline
- kong or oauth2_proxy for auth, rate-limit, etc
- "authn" microservice: https://keratin.tech/
- PUT for mid-edit revisions
- 'parent rev' for revisions (vs. container parent)
- "submit" status for editgroups?

review
- what does openlibrary API look like?
x add a 'live' (or 'immutable') flag to revision tables

CSL:
- https://citationstyles.org/
- https://github.com/citation-style-language/documentation/blob/master/primer.txt
- https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html
- https://github.com/citation-style-language/schema/blob/master/csl-types.rnc
- perhaps a "create from CSL" endpoint?