summaryrefslogtreecommitdiffstats
path: root/rfc.md
blob: 1b0904436cd91952a3364e4c9af3f09437753e28 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154

fatcat is a half-baked idea to build an open, independent, collaboratively
editable bibliographic database of most written works.

## Technical Architecture

The canonical backend datastore would be a very large transactional SQL server.
A relatively simple and stable back-end daemon would expose an API (could be
REST, GraphQL, gRPC, etc). As little "application logic" as possible would be
embedded in this back-end; as much as possible would be pushed to bots which
could be authored and operated by anybody. A separate web interface project
would talk to the API backend and could be developed more rapidly.

## Editing Workflow and Bots

Both human editors and bots would have edits go through the same API, with
humans using either the default web interface or arbitrary integrations or
client software.

The usual workflow would be to create edits (or creations, merges, deletions)
to individual entities one at a time, all under a single "edit group" of
related edits (eg, correcting authorship info for multiple works related to a
single author). When ready, the editor would "submit" the edit group for
review. During the review period, humans could vote (or veto/approve if they
have higher permissions), and bots can perform automated checks. During this
period the editor can make tweaks if necessary. After some fixed time period
(72 hours?) with no changes and no blocking issues, the edit group would be
auto-accepted, if no auto-resolvable merge-conflicts have arisen. This process
balances editing labor (reviews are easy, but optional) against quality
(cool-down period makes it easier to detect and prevent spam or out-of-control
bots). Advanced permissions could allow some trusted human and bot editors to
push through edits more rapidly.

Bots would need to be tuned to have appropriate edit group sizes (eg, daily
batches, instead of millions of works in a single edit) to make human QA and
reverts possible.

Data progeny and citation would be left to the edit history. In the case of
importing external databases, the expectation would be that special-purpose
bot accounts would be used. Human editors would leave edit messages to clarify
their sources.

A style guide (wiki), chat room, and discussion forum would be hosted as
separate stand-alone services for editors to propose projects and debate
process or scope changes. It would be best if these could use federated account
authorization (oauth?) to have consistent account IDs across mediums.

## Itentifiers

A fixed number of first class "entities" would be definied, with common
behavior and schema layouts. These would all be semantic entities like "work",
"edition", "container", and "person".

fatcat identifiers would be semanticly meaningless fixed length random numbers,
usually represented in case-insensitive base32 format. Each entity type would
have it's own identifier namespace. Eg, 96 bit identifiers would have 26
characters and look like:

    fcwork_rzga5b9cd7efgh04iljk

As a URL:

    https://fatcat.org/work/rzga5b9cd7efgh04iljk

A 64 bit namespace is probably plenty though:

    fcwork_rzga5b9cd7efg
    https://fatcat.org/work/rzga5b9cd7efg

The idea would be to only have fatcat identifiers be used to interlink between
databases, *not* to supplant DOIs, ISBNs, handle, ARKs, and other "registered"
persistant identifiers.

## Entities and Internal Schema

Internally, identifiers would be lightweight pointers to actual metadata
objects, which can be thought of as "versions". The metadata objects themselves
would be immutable once commited; the edit process is one of creating new
objects and, if the edit is approved, pointing the identifier to the new
version. Entities would reference between themselves by identifier.

Edit objects represent a change to a single entity; edits get batched together
into edit groups (like "commits" and "pull requests" in git parlance).

SQL tables would probably look something like the following, though be specific
to each entity type (eg, there would be an actual `work_revision` table, but
not an actual `entity_revision` table):

    entity_id
        uuid
        current_revision

    entity_revision
        entity_id (bi-directional?)
        previous: entity_revision or none
        state: normal, redirect, deletion
        redirect_entity_id: optional
        extra: json blob
        edit_id

    edit
        mutable: boolean
        edit_group
        editor

    edit_group

Additional type-specific columns would hold actual metadata. Additional tables
(which would reference both `entity_revision` and `entity_id` foreign keys as
appropriate) would represent things like external identifiers, ordered
author/work relationships, citations between works, etc. Every revision of an
entity would require duplicating all of these associated rows, which could end
up being a large source of inefficiency, but is necessary to represent the full
history of an object.

## Scope

Want the "scholarly web": the graph of works that cite other works. Certainly
every work that is cited more than once and every work that both cites and is
cited; "leaf nodes" and small islands might not be in scope.

Focusing on written works, with some exceptions. Expect core media (for which we would pursue "completeness") to be:

    journal articles
    books
    conference proceedings
    technical memos
    dissertations

Probably in scope:

    reports
    magazine articles
    published poetry
    essays
    government documents
    conference
    presentations (slides, video)

Probably not:

    patents
    court cases and legal documents
    manuals
    datasheets
    courses

Definitely not:

    audio recordings
    tv show episodes
    musical scores
    advertisements