diff options
Diffstat (limited to 'guide/src/implementation.md')
-rw-r--r-- | guide/src/implementation.md | 96 |
1 files changed, 96 insertions, 0 deletions
diff --git a/guide/src/implementation.md b/guide/src/implementation.md index 66ae7f6b..33a53c21 100644 --- a/guide/src/implementation.md +++ b/guide/src/implementation.md @@ -24,3 +24,99 @@ to a rigid third-party ontology or schema. Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the database in those formats. + +### Fatcat Identifiers + +Fatcat identifiers are semantically meaningless fixed-length random numbers, +usually represented in case-insensitive base32 format. Each entity type has its +own identifier namespace. + +128-bit (UUID size) identifiers encode as 26 characters (but note that not all +such strings decode to valid UUIDs), and in the backend can be serialized in +UUID columns: + + work_rzga5b9cd7efgh04iljk8f3jvz + https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz + +In comparison, 96-bit identifiers would have 20 characters and look like: + + work_rzga5b9cd7efgh04iljk + https://fatcat.wiki/work/rzga5b9cd7efgh04iljk + +and 64-bit: + + work_rzga5b9cd7efg + https://fatcat.wiki/work/rzga5b9cd7efg + +Fatcat identifiers can used to interlink between databases, but are explicitly +*not* intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" +persistent identifiers for general use. + +### Internal Schema + +Internally, identifiers are lightweight pointers to "revisions" of an entity. +Revisions are stored in their complete form, not as a patch or difference; if +comparing to distributed version control systems (for managing changes to +source code), this follows the git model, not the mercurial model. + +The entity revisions are immutable once accepted; the editing process involves +the creation of new entity revisions and, if the edit is approved, pointing the +identifier to the new revision. Entities cross-reference between themselves by +*identifier* not *revision number*. Identifier pointers also support +(versioned) deletion and redirects (for merging entities). + +Edit objects represent a change to a single entity; edits get batched together +into edit groups (like "commits" and "pull requests" in git parlance). + +SQL tables look something like this (with separate tables for entity type a la +`work_revision` and `work_edit`): + + entity_ident + id (uuid) + current_revision (entity_revision foreign key) + redirect_id (optional; points to another entity_ident) + is_live (boolean; whether newly created entity has been accepted) + + entity_revision + revision_id + <all entity-style-specific fields> + extra: json blob for schema evolution + + entity_edit + timestamp + editgroup_id (editgroup foreign key) + ident (entity_ident foreign key) + new_revision (entity_revision foreign key) + new_redirect (optional; points to entity_ident table) + previous_revision (optional; points to entity_revision) + extra: json blob for provenance metadata + + editgroup + editor_id (editor table foreign key) + description + extra: json blob for provenance metadata + +An individual entity can be in the following "states", from which the given +actions (transition) can be made: + +- `wip` (not live; not redirect; has rev) + - activate (to `active`) +- `active` (live; not redirect; has rev) + - redirect (to `redirect`) + - delete (to `deleted`) +- `redirect` (live; redirect; rev or not) + - split (to `active`) + - delete (to `delete`) +- `deleted` (live; not redirect; no rev) + - redirect (to `redirect`) + - activate (to `active`) + +"WIP, redirect" or "WIP, deleted" are invalid states. + +Additional entity-specific columns hold actual metadata. Additional +tables (which reference both `entity_revision` and `entity_id` foreign +keys as appropriate) represent things like authorship relationships +(creator/release), citations between works, etc. Every revision of an entity +requires duplicating all of these associated rows, which could end up +being a large source of inefficiency, but is necessary to represent the full +history of an object. |