diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2019-02-14 15:23:34 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2019-02-14 15:23:34 -0800 |
commit | 2d18d90e8ebf3055e091781630076558c8c8cd50 (patch) | |
tree | be20ce80c4fb749bf40b6022971f5df851132245 | |
parent | 643d32f629ddd362e4e5075cfe0c399e6e6a0d84 (diff) | |
download | fatcat-2d18d90e8ebf3055e091781630076558c8c8cd50.tar.gz fatcat-2d18d90e8ebf3055e091781630076558c8c8cd50.zip |
review/update data model page
-rw-r--r-- | guide/src/data_model.md | 144 | ||||
-rw-r--r-- | guide/src/implementation.md | 96 |
2 files changed, 121 insertions, 119 deletions
diff --git a/guide/src/data_model.md b/guide/src/data_model.md index 21d265e1..6953e107 100644 --- a/guide/src/data_model.md +++ b/guide/src/data_model.md @@ -12,45 +12,50 @@ artifacts) over physical items, the primary bibliographic entity types are: website, translated into multiple languages, and then re-published (with minimal changes) as a book chapter; these would all be variants of the same `work`. -- `release`: a specific "release" or "publicly published" (in a formal or - informal sense) version of a work. Contains traditional bibliographic - metadata (title, date of publication, media type, language, etc). Has - relationships to other entities: - - "variant of" a single `work` - - "contributed to by" multiple `creators` - - "references to" (cites) multiple `releases` - - "published as part of" a single `container` +- `release`: a specific "release" or "publicly published" version of a work. + Contains traditional bibliographic metadata (title, date of publication, + media type, language, etc). Has relationships to other entities: + - child of a single `work` (required) + - multiple `creator` entities as "contributors" (authors, editors) + - outbound references to multiple other `release` entities + - member of a single `container`, for example a journal or book series - `file`: a single concrete, fixed digital artifact; a manifestation of one or more `releases`. Machine-verifiable metadata includes file hashes, size, and detected file format. Verified URLs link to locations on the open web where this file can be found or has been archived. Has relationships: - - "manifestation of" multiple `releases` (though usually a single release) + - multiple `release` entieis that this file is a complete manifestation of + (almost always a single release) +- `fileset`: a list of muliple concrete files, together forming complete + `release` manifestation. Primarily intended for datasets and supplementary + materials; could also contain a paper "package" (source file and figures). +- `webcapture`: a single snapshot (point in time) of a webpage or small website + (multiple pages) which are a complete manifestation of a `release`. Not a + landing page or page referencing the release. - `creator`: persona (pseudonym, group, or specific human name) that - contributions to `releases` have been attributed to. Not necessarily - one-to-one with a human person. + has contributed to one or more `release`. Not necessarily one-to-one with a + human person. - `container` (aka "venue", "serial", "title"): a grouping of releases from a single publisher. Note that, compared to many similar bibliographic ontologies, the current one does not have entities to represent: +- physical artifacts, either generically or specific copies - funding sources - publishing entities - "events at a time and place" -- physical artifacts, either generically or specific copies -- sets of files (eg, a dataset or webpage with media) Each entity type has it's own relations and fields (captured in a schema), but there are are also generic operations and fields common across all entities. -The process of creating, updating, querying, and inspecting entities is roughly +The API for creating, updating, querying, and inspecting entities is roughly the same regardless of type. ## Identifiers and Revisions A specific version of any entity in the catalog is called a "revision". Revisions are generally immutable (do not change and are not editable), and are -not usually referred to directly by users. Instead, persistent identifiers can -be created, which "point to" a specific revision at a time. This distinction +not normally referred to directly. Instead, persistent "fatcat identifiers" can +be created, which "point to" a single revision at a time. This distinction means that entities referred to by an identifier can change over time (as metadata is corrected and expanded). Revision objects do not "point" back to specific identifiers, so they are not the same as a simple "version number" for @@ -62,107 +67,8 @@ All changes to identifiers are captured as an "edit" object. Edit history can be fetched and inspected on a per-identifier basis, and any changes can easily be reverted (even merges/redirects and "deletion"). -"Staged" or "proposed" changes are captured as edit objects without updating -the identifiers themselves. - -### Fatcat Identifiers - -Fatcat identifiers are semantically meaningless fixed-length random numbers, -usually represented in case-insensitive base32 format. Each entity type has its -own identifier namespace. - -128-bit (UUID size) identifiers encode as 26 characters (but note that not all -such strings decode to valid UUIDs), and in the backend can be serialized in -UUID columns: - - work_rzga5b9cd7efgh04iljk8f3jvz - https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz - -In comparison, 96-bit identifiers would have 20 characters and look like: - - work_rzga5b9cd7efgh04iljk - https://fatcat.wiki/work/rzga5b9cd7efgh04iljk - -A 64-bit namespace would probably be large enough, and would work with -database Integer columns: - - work_rzga5b9cd7efg - https://fatcat.wiki/work/rzga5b9cd7efg - -Fatcat identifiers can used to interlink between databases, but are explicitly -*not* intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" -persistent identifiers. - -### Entity States - -### Internal Schema - -Internally, identifiers are lightweight pointers to "revisions" of an entity. -Revisions are stored in their complete form, not as a patch or difference; if -comparing to distributed version control systems (for managing changes to -source code), this follows the git model, not the mercurial model. - -The entity revisions are immutable once accepted; the editing process involves -the creation of new entity revisions and, if the edit is approved, pointing the -identifier to the new revision. Entities cross-reference between themselves by -*identifier* not *revision number*. Identifier pointers also support -(versioned) deletion and redirects (for merging entities). - -Edit objects represent a change to a single entity; edits get batched together -into edit groups (like "commits" and "pull requests" in git parlance). - -SQL tables look something like this (with separate tables for entity type a la -`work_revision` and `work_edit`): - - entity_ident - id (uuid) - current_revision (entity_revision foreign key) - redirect_id (optional; points to another entity_ident) - is_live (boolean; whether newly created entity has been accepted) - - entity_revision - revision_id - <all entity-style-specific fields> - extra: json blob for schema evolution - - entity_edit - timestamp - editgroup_id (editgroup foreign key) - ident (entity_ident foreign key) - new_revision (entity_revision foreign key) - new_redirect (optional; points to entity_ident table) - previous_revision (optional; points to entity_revision) - extra: json blob for provenance metadata - - editgroup - editor_id (editor table foreign key) - description - extra: json blob for provenance metadata - -An individual entity can be in the following "states", from which the given -actions (transition) can be made: - -- `wip` (not live; not redirect; has rev) - - activate (to `active`) -- `active` (live; not redirect; has rev) - - redirect (to `redirect`) - - delete (to `deleted`) -- `redirect` (live; redirect; rev or not) - - split (to `active`) - - delete (to `delete`) -- `deleted` (live; not redirect; no rev) - - redirect (to `redirect`) - - activate (to `active`) - -"WIP, redirect" or "WIP, deleted" are invalid states. - -Additional entity-specific columns hold actual metadata. Additional -tables (which reference both `entity_revision` and `entity_id` foreign -keys as appropriate) represent things like authorship relationships -(creator/release), citations between works, etc. Every revision of an entity -requires duplicating all of these associated rows, which could end up -being a large source of inefficiency, but is necessary to represent the full -history of an object. +"Work in progress" or "proposed" updates are staged as edit objects without +updating the identifiers themselves. ## Controlled Vocabularies @@ -170,7 +76,6 @@ Some individual fields have additional constraints, either in the form of pattern validation ("values must be upper case, contain only certain characters"), or membership in a fixed set of values. These may include: -- subject categorization - license and open access status - work "types" (article vs. book chapter vs. proceeding, etc) - contributor types (author, translator, illustrator, etc) @@ -180,7 +85,8 @@ characters"), or membership in a fixed set of values. These may include: Other fixed-set "vocabularies" become too large to easily maintain or express in code. These could be added to the backend databases, or be enforced by bots -(instead of the core system itself). These mostly include externally-registered identifiers or types, such as: +(instead of the system itself). These mostly include externally-registered +identifiers or types, such as: - file mimetypes - identifiers themselves (DOI, ORCID, etc), by checking for registration diff --git a/guide/src/implementation.md b/guide/src/implementation.md index 66ae7f6b..33a53c21 100644 --- a/guide/src/implementation.md +++ b/guide/src/implementation.md @@ -24,3 +24,99 @@ to a rigid third-party ontology or schema. Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the database in those formats. + +### Fatcat Identifiers + +Fatcat identifiers are semantically meaningless fixed-length random numbers, +usually represented in case-insensitive base32 format. Each entity type has its +own identifier namespace. + +128-bit (UUID size) identifiers encode as 26 characters (but note that not all +such strings decode to valid UUIDs), and in the backend can be serialized in +UUID columns: + + work_rzga5b9cd7efgh04iljk8f3jvz + https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz + +In comparison, 96-bit identifiers would have 20 characters and look like: + + work_rzga5b9cd7efgh04iljk + https://fatcat.wiki/work/rzga5b9cd7efgh04iljk + +and 64-bit: + + work_rzga5b9cd7efg + https://fatcat.wiki/work/rzga5b9cd7efg + +Fatcat identifiers can used to interlink between databases, but are explicitly +*not* intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered" +persistent identifiers for general use. + +### Internal Schema + +Internally, identifiers are lightweight pointers to "revisions" of an entity. +Revisions are stored in their complete form, not as a patch or difference; if +comparing to distributed version control systems (for managing changes to +source code), this follows the git model, not the mercurial model. + +The entity revisions are immutable once accepted; the editing process involves +the creation of new entity revisions and, if the edit is approved, pointing the +identifier to the new revision. Entities cross-reference between themselves by +*identifier* not *revision number*. Identifier pointers also support +(versioned) deletion and redirects (for merging entities). + +Edit objects represent a change to a single entity; edits get batched together +into edit groups (like "commits" and "pull requests" in git parlance). + +SQL tables look something like this (with separate tables for entity type a la +`work_revision` and `work_edit`): + + entity_ident + id (uuid) + current_revision (entity_revision foreign key) + redirect_id (optional; points to another entity_ident) + is_live (boolean; whether newly created entity has been accepted) + + entity_revision + revision_id + <all entity-style-specific fields> + extra: json blob for schema evolution + + entity_edit + timestamp + editgroup_id (editgroup foreign key) + ident (entity_ident foreign key) + new_revision (entity_revision foreign key) + new_redirect (optional; points to entity_ident table) + previous_revision (optional; points to entity_revision) + extra: json blob for provenance metadata + + editgroup + editor_id (editor table foreign key) + description + extra: json blob for provenance metadata + +An individual entity can be in the following "states", from which the given +actions (transition) can be made: + +- `wip` (not live; not redirect; has rev) + - activate (to `active`) +- `active` (live; not redirect; has rev) + - redirect (to `redirect`) + - delete (to `deleted`) +- `redirect` (live; redirect; rev or not) + - split (to `active`) + - delete (to `delete`) +- `deleted` (live; not redirect; no rev) + - redirect (to `redirect`) + - activate (to `active`) + +"WIP, redirect" or "WIP, deleted" are invalid states. + +Additional entity-specific columns hold actual metadata. Additional +tables (which reference both `entity_revision` and `entity_id` foreign +keys as appropriate) represent things like authorship relationships +(creator/release), citations between works, etc. Every revision of an entity +requires duplicating all of these associated rows, which could end up +being a large source of inefficiency, but is necessary to represent the full +history of an object. |