diff options
Diffstat (limited to 'guide/src/data_model.md')
-rw-r--r-- | guide/src/data_model.md | 220 |
1 files changed, 124 insertions, 96 deletions
diff --git a/guide/src/data_model.md b/guide/src/data_model.md index b2a02688..f3b9b35a 100644 --- a/guide/src/data_model.md +++ b/guide/src/data_model.md @@ -1,12 +1,73 @@ # Data Model -## Identifiers - -A fixed number of first-class "entities" are defined, with common behavior and -schema layouts. These are all be semantic entities like "work", "release", -"container", and "creator". - -fatcat identifiers are semantically meaningless fixed-length random numbers, +## Entity Types and Ontology + +Loosely following "Functional Requirements for Bibliographic Records" (FRBR), +but removing the "manifestation" abstraction, and favoring files (digital +artifacts) over physical items, the primary bibliographic entity types are: + +- `work`: representing an abstract unit of creative output. Does not contain + any metadata itself; used only to group `release` entities. For example, a + journal article could be posted as a pre-print, published on a journal + website, translated into multiple languages, and then re-published (with + minimal changes) as a book chapter; these would all be variants of the same + `work`. +- `release`: a specific "release" or "publicly published" (in a formal or + informal sense) version of a work. Contains traditional bibliographic + metadata (title, date of publiction, media type, language, etc). Has + relationships to other entities: + - "variant of" a single `work` + - "contributed to by" multiple `creators` + - "references to" (cites) multiple `releases` + - "published as part of" a single `container` +- `file`: a single concrete, fixed ditigal artifact; a manifestation of one or + more `releases`. Machine-verifiable metadata includes file hashes, size, and + detected file format. Verified URLs link to locations on the open web where + this file can be found or has been archived. Has relationships: + - "manifestation of" multiple `releases` (though usually a single release) +- `creator`: persona (pseudonym, group, or specific human name) that + contributions to `releases` have been attributed to. Not necessarily + one-to-one with a human person. +- `container` (aka "venue", "serial", "title"): a grouping of releases from a + single publisher. + +Note that, compared to many similar bibliographic ontologies, the current one +does not have entities to represent: + +- funding sources +- publishing entities +- "events at a time and place" +- physical artifacts, either generically or specific copies +- sets of files (eg, a dataset or webpage with media) + +Each entity type has it's own relations and fields (captured in a schema), but +there are are also generic operations and fields common across all entities. +The process of creating, updating, querying, and inspecting entities is roughly +the same regardless of type. + +## Identifiers and Revisions + +A specific version of any entity in the catalog is called a "revision". +Revisions are generally immutable (do not change and are not editable), and are +not usually refered to directly by users. Instead, persistent identifiers can +be created, which "point to" a specific revsiion at a time. This distinction +means that entities refered to by an identifier can change over time (as +metadata is corrected and expanded). Revision objects do not "point" back to +specific identifiers, so they are not the same as a simple "version number" for +an identifier. + +Identifiers also have the ability to be merged (by redirecting one identifier +to another) and "deleted" (by pointing the identifier to no revision at all). +All changes to identifiers are captured as an "edit" object. Edit history can +be fetched and inspected on a per-identifier basis, and any changes can easily +be reverted (even merges/redirects and "deletion"). + +"Staged" or "proposed" changes are captured as edit objects without updating +the identifers themselves. + +### Fatcat Identifiers + +Fatcat identifiers are semantically meaningless fixed-length random numbers, usually represented in case-insensitive base32 format. Each entity type has its own identifier namespace. @@ -28,16 +89,18 @@ database Integer columns: work_rzga5b9cd7efg https://fatcat.wiki/work/rzga5b9cd7efg -The idea would be to only have fatcat identifiers be used to interlink between -databases, *not* to supplant DOIs, ISBNs, handle, ARKs, and other "registered" +Fatcat identifiers can used to interlink between databases, but are explicitly +*not* intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" persistent identifiers. -## Entities and Internal Schema +### Entity States + +### Internal Schema -Internally, identifiers would be lightweight pointers to "revisions" of an -entity. Revisions are stored in their complete form, not as a patch or -difference; if comparing to distributed version control systems, this is the -git model, not the mercurial model. +Internally, identifiers are lightweight pointers to "revisions" of an entity. +Revisions are stored in their complete form, not as a patch or difference; if +comparing to distributed version control systems (for managing changes to +source code), this follows the git model, not the mercurial model. The entity revisions are immutable once accepted; the editting process involves the creation of new entity revisions and, if the edit is approved, pointing the @@ -48,122 +111,87 @@ identifier to the new revision. Entities cross-reference between themselves by Edit objects represent a change to a single entity; edits get batched together into edit groups (like "commits" and "pull requests" in git parlance). -SQL tables would probably look something like the (but specific to each entity -type, with tables like `work_revision` not `entity_revision`): +SQL tables look something like this (with separate tables for entity type a la +`work_revision` and `work_edit`): entity_ident id (uuid) current_revision (entity_revision foreign key) redirect_id (optional; points to another entity_ident) + is_live (boolean; whether newly created entity has been accepted) entity_revision revision_id - <entity-specific fields> + <all entity-tyle-specific fields> extra: json blob for schema evolution entity_edit timestamp - editgroup_id + editgroup_id (editgroup foreign key) ident (entity_ident foreign key) new_revision (entity_revision foreign key) + new_redirect (optional; points to entity_ident table) previous_revision (optional; points to entity_revision) extra: json blob for progeny metadata editgroup - editor_id + editor_id (editor table foreign key) description extra: json blob for progeny metadata -Additional entity-specific columns would hold actual metadata. Additional -tables (which would reference both `entity_revision` and `entity_id` foreign -keys as appropriate) would represent things like authorship relationships +An individual entity can be in the following "states", from which the given +actions (transistion) can be made: + +- `wip` (not live; not redirect; has rev) + - activate (to `active`) +- `active` (live; not redirect; has rev) + - redirect (to `redirect`) + - delete (to `deleted`) +- `redirect` (live; redirect; rev or not) + - split (to `active`) + - delete (to `delete`) +- `deleted` (live; not redirect; no rev) + - redirect (to `redirect`) + - activate (to `active`) + +"WIP, redirect" or "WIP, deleted" are invalid states. + +Additional entity-specific columns hold actual metadata. Additional +tables (which reference both `entity_revision` and `entity_id` foreign +keys as appropriate) represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity -would require duplicating all of these associated rows, which could end up +requires duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object. -## Ontology - -Loosely following FRBR (Functional Requirements for Bibliographic Records), but -removing the "manifestation" abstraction, and favoring files (digital -artifacts) over physical items, the primary entities are: - - work - <a stub, for grouping releases> - - release (aka "edition", "variant") - title - volume/pages/issue/chapter - media/formfactor - publication/peer-review status - language - <published> date - <variant-of> work - <published-in> container - <has-contributors> creator - <citation-to> release - <has> identifier - - file (aka "digital artifact") - <instantiates> release - hashes/checksums - mimetype - <found-at> URLs - - creator (aka "author") - name - identifiers - aliases - - container (aka "venue", "serial", "title") - name - open-access policy - peer-review policy - <has> aliases, acronyms - <about> subject/category - <has> identifier - <published-in> container - <published-by> publisher - -## Controlled Vocabularies - -Some special namespace tables and enums would probably be helpful; these could -live in the database (not requiring a database migration to update), but should -have more controlled editing workflow... perhaps versioned in the codebase: +## Controlled Vocabularies + +Some individual fields have additional contraints, either in the form of +pattern validation ("values must be upper case, contain only certain +characters"), or membership in a fixed set of values. These may include: -- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers - themselves) - subject categorization - license and open access status - work "types" (article vs. book chapter vs. proceeding, etc) - contributor types (author, translator, illustrator, etc) - human languages -- file mimetypes - -These could also be enforced by QA bots that review all editgroups. - -## Entity States +- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers + themselves) - wip (not live; not redirect; has rev) - activate - active (live; not redirect; has rev) - redirect - delete - redirect (live; redirect; rev or not) - split - delete - deleted (live; not redirect; no rev) - redirect - activate +Other fixed-set "vocabularies" become too large to easily maintain or express +in code. These could be added to the backend databases, or be enforced by bots +(instead of the core system itself). These mostly include externally-registered identifiers or types, such as: - "wip redirect" or "wip deleted" are invalid states +- file mimetypes +- identifiers themselves (DOI, ORCID, etc), by checking for registeration + against canonical APIs and databases ## Global Edit Changelog -As part of the process of "accepting" an edit group, a row would be written to -an immutable, append-only log table (which internally could be a SQL table) -documenting each identifier change. This changelog establishes a monotonically -increasing version number for the entire corpus, and should make interaction -with other systems easier (eg, search engines, replicated databases, -alternative storage backends, notification frameworks, etc.). +As part of the process of "accepting" an edit group, a row is written to an +immutable, append-only table (which internally is a SQL table) documenting each +identifier change. This changelog establishes a monotonically increasing +version number for the entire corpus, and should make interaction with other +systems easier (eg, search engines, replicated databases, alternative storage +backends, notification frameworks, etc.). |