guide/src/data_model.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

# Data Model

## Identifiers

A fixed number of first-class "entities" are defined, with common behavior and
schema layouts. These are all be semantic entities like "work", "release",
"container", and "creator".

fatcat identifiers are semantically meaningless fixed-length random numbers,
usually represented in case-insensitive base32 format. Each entity type has its
own identifier namespace.

128-bit (UUID size) identifiers encode as 26 characters (but note that not all
such strings decode to valid UUIDs), and in the backend can be serialized in
UUID columns:

    work_rzga5b9cd7efgh04iljk8f3jvz
    https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz

In comparison, 96-bit identifiers would have 20 characters and look like:

    work_rzga5b9cd7efgh04iljk
    https://fatcat.wiki/work/rzga5b9cd7efgh04iljk

A 64-bit namespace would probably be large enought, and would work with
database Integer columns:

    work_rzga5b9cd7efg
    https://fatcat.wiki/work/rzga5b9cd7efg

The idea would be to only have fatcat identifiers be used to interlink between
databases, *not* to supplant DOIs, ISBNs, handle, ARKs, and other "registered"
persistent identifiers.

## Entities and Internal Schema

Internally, identifiers would be lightweight pointers to "revisions" of an
entity. Revisions are stored in their complete form, not as a patch or
difference; if comparing to distributed version control systems, this is the
git model, not the mercurial model.

The entity revisions are immutable once accepted; the editting process involves
the creation of new entity revisions and, if the edit is approved, pointing the
identifier to the new revision. Entities cross-reference between themselves by
*identifier* not *revision number*. Identifier pointers also support
(versioned) deletion and redirects (for merging entities).

Edit objects represent a change to a single entity; edits get batched together
into edit groups (like "commits" and "pull requests" in git parlance).

SQL tables would probably look something like the (but specific to each entity
type, with tables like `work_revision` not `entity_revision`):

    entity_ident
        id (uuid)
        current_revision (entity_revision foreign key)
        redirect_id (optional; points to another entity_ident)

    entity_revision
        revision_id
        <entity-specific fields>
        extra: json blob for schema evolution

    entity_edit
        timestamp
        editgroup_id
        ident (entity_ident foreign key)
        new_revision (entity_revision foreign key)
        previous_revision (optional; points to entity_revision)
        extra: json blob for progeny metadata

    editgroup
        editor_id
        description
        extra: json blob for progeny metadata

Additional entity-specific columns would hold actual metadata. Additional
tables (which would reference both `entity_revision` and `entity_id` foreign
keys as appropriate) would represent things like authorship relationships
(creator/release), citations between works, etc. Every revision of an entity
would require duplicating all of these associated rows, which could end up
being a large source of inefficiency, but is necessary to represent the full
history of an object.

## Ontology

Loosely following FRBR (Functional Requirements for Bibliographic Records), but
removing the "manifestation" abstraction, and favoring files (digital
artifacts) over physical items, the primary entities are:

    work
        <a stub, for grouping releases>

    release (aka "edition", "variant")
        title
        volume/pages/issue/chapter
        media/formfactor
        publication/peer-review status
        language
        <published> date
        <variant-of> work
        <published-in> container
        <has-contributors> creator
        <citation-to> release
        <has> identifier

    file (aka "digital artifact")
        <instantiates> release
        hashes/checksums
        mimetype
        <found-at> URLs

    creator (aka "author")
        name
        identifiers
        aliases

    container (aka "venue", "serial", "title")
        name
        open-access policy
        peer-review policy
        <has> aliases, acronyms
        <about> subject/category
        <has> identifier
        <published-in> container
        <published-by> publisher

## Controlled Vocabularies

Some special namespace tables and enums would probably be helpful; these could
live in the database (not requiring a database migration to update), but should
have more controlled editing workflow... perhaps versioned in the codebase:

- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers
  themselves)
- subject categorization
- license and open access status
- work "types" (article vs. book chapter vs. proceeding, etc)
- contributor types (author, translator, illustrator, etc)
- human languages
- file mimetypes

These could also be enforced by QA bots that review all editgroups.

## Entity States

    wip (not live; not redirect; has rev)
      activate
    active (live; not redirect; has rev)
      redirect
      delete
    redirect (live; redirect; rev or not)
      split
      delete
    deleted (live; not redirect; no rev)
      redirect
      activate

    "wip redirect" or "wip deleted" are invalid states

## Global Edit Changelog

As part of the process of "accepting" an edit group, a row would be written to
an immutable, append-only log table (which internally could be a SQL table)
documenting each identifier change. This changelog establishes a monotonically
increasing version number for the entire corpus, and should make interaction
with other systems easier (eg, search engines, replicated databases,
alternative storage backends, notification frameworks, etc.).