large progress on guide

Don't have enough time to complete and copy-edit this now though.
author: Bryan Newbold <bnewbold@robocracy.org> 2018-09-21 12:33:35 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2018-09-21 12:34:06 -0700
commit: 1915c7b885641a34191efeee2cc8525a6ad27b9f (patch)
tree: c26b8a772d8e79689b0b7bf6498590d517717ece /guide/src/data_model.md
parent: a1e5acf125decc0f2af28beca43e91b4085cc3d9 (diff)
download: fatcat-1915c7b885641a34191efeee2cc8525a6ad27b9f.tar.gz
fatcat-1915c7b885641a34191efeee2cc8525a6ad27b9f.zip
1 files changed, 124 insertions, 96 deletions
diff --git a/guide/src/data_model.md b/guide/src/data_model.md
index b2a02688..f3b9b35a 100644
--- a/guide/src/data_model.md
+++ b/guide/src/data_model.md
@@ -1,12 +1,73 @@
 # Data Model
 
-## Identifiers
-
-A fixed number of first-class "entities" are defined, with common behavior and
-schema layouts. These are all be semantic entities like "work", "release",
-"container", and "creator".
-
-fatcat identifiers are semantically meaningless fixed-length random numbers,
+## Entity Types and Ontology
+
+Loosely following "Functional Requirements for Bibliographic Records" (FRBR),
+but removing the "manifestation" abstraction, and favoring files (digital
+artifacts) over physical items, the primary bibliographic entity types are:
+
+- `work`: representing an abstract unit of creative output. Does not contain
+  any metadata itself; used only to group `release` entities. For example, a
+  journal article could be posted as a pre-print, published on a journal
+  website, translated into multiple languages, and then re-published (with
+  minimal changes) as a book chapter; these would all be variants of the same
+  `work`.
+- `release`: a specific "release" or "publicly published" (in a formal or
+  informal sense) version of a work. Contains traditional bibliographic
+  metadata (title, date of publiction, media type, language, etc). Has
+  relationships to other entities:
+    - "variant of" a single `work`
+    - "contributed to by" multiple `creators`
+    - "references to" (cites) multiple `releases`
+    - "published as part of" a single `container`
+- `file`: a single concrete, fixed ditigal artifact; a manifestation of one or
+  more `releases`. Machine-verifiable metadata includes file hashes, size, and
+  detected file format. Verified URLs link to locations on the open web where
+  this file can be found or has been archived. Has relationships:
+    - "manifestation of" multiple `releases` (though usually a single release)
+- `creator`: persona (pseudonym, group, or specific human name) that
+  contributions to `releases` have been attributed to. Not necessarily
+  one-to-one with a human person.
+- `container` (aka "venue", "serial", "title"): a grouping of releases from a
+  single publisher.
+
+Note that, compared to many similar bibliographic ontologies, the current one
+does not have entities to represent:
+
+- funding sources
+- publishing entities
+- "events at a time and place"
+- physical artifacts, either generically or specific copies
+- sets of files (eg, a dataset or webpage with media)
+
+Each entity type has it's own relations and fields (captured in a schema), but
+there are are also generic operations and fields common across all entities.
+The process of creating, updating, querying, and inspecting entities is roughly
+the same regardless of type.
+
+## Identifiers and Revisions
+
+A specific version of any entity in the catalog is called a "revision".
+Revisions are generally immutable (do not change and are not editable), and are
+not usually refered to directly by users. Instead, persistent identifiers can
+be created, which "point to" a specific revsiion at a time. This distinction
+means that entities refered to by an identifier can change over time (as
+metadata is corrected and expanded). Revision objects do not "point" back to
+specific identifiers, so they are not the same as a simple "version number" for
+an identifier.
+
+Identifiers also have the ability to be merged (by redirecting one identifier
+to another) and "deleted" (by pointing the identifier to no revision at all).
+All changes to identifiers are captured as an "edit" object. Edit history can
+be fetched and inspected on a per-identifier basis, and any changes can easily
+be reverted (even merges/redirects and "deletion").
+
+"Staged" or "proposed" changes are captured as edit objects without updating
+the identifers themselves.
+
+### Fatcat Identifiers
+
+Fatcat identifiers are semantically meaningless fixed-length random numbers,
 usually represented in case-insensitive base32 format. Each entity type has its
 own identifier namespace.
 
@@ -28,16 +89,18 @@ database Integer columns:
     work_rzga5b9cd7efg
     https://fatcat.wiki/work/rzga5b9cd7efg
 
-The idea would be to only have fatcat identifiers be used to interlink between
-databases, *not* to supplant DOIs, ISBNs, handle, ARKs, and other "registered"
+Fatcat identifiers can used to interlink between databases, but are explicitly
+*not* intended to supplant DOIs, ISBNs, handle, ARKs, and other "registered"
 persistent identifiers.
 
-## Entities and Internal Schema
+### Entity States
+
+### Internal Schema
 
-Internally, identifiers would be lightweight pointers to "revisions" of an
-entity. Revisions are stored in their complete form, not as a patch or
-difference; if comparing to distributed version control systems, this is the
-git model, not the mercurial model.
+Internally, identifiers are lightweight pointers to "revisions" of an entity.
+Revisions are stored in their complete form, not as a patch or difference; if
+comparing to distributed version control systems (for managing changes to
+source code), this follows the git model, not the mercurial model.
 
 The entity revisions are immutable once accepted; the editting process involves
 the creation of new entity revisions and, if the edit is approved, pointing the
@@ -48,122 +111,87 @@ identifier to the new revision. Entities cross-reference between themselves by
 Edit objects represent a change to a single entity; edits get batched together
 into edit groups (like "commits" and "pull requests" in git parlance).
 
-SQL tables would probably look something like the (but specific to each entity
-type, with tables like `work_revision` not `entity_revision`):
+SQL tables look something like this (with separate tables for entity type a la
+`work_revision` and `work_edit`):
 
     entity_ident
         id (uuid)
         current_revision (entity_revision foreign key)
         redirect_id (optional; points to another entity_ident)
+        is_live (boolean; whether newly created entity has been accepted)
 
     entity_revision
         revision_id
-        <entity-specific fields>
+        <all entity-tyle-specific fields>
         extra: json blob for schema evolution
 
     entity_edit
         timestamp
-        editgroup_id
+        editgroup_id (editgroup foreign key)
         ident (entity_ident foreign key)
         new_revision (entity_revision foreign key)
+        new_redirect (optional; points to entity_ident table)
         previous_revision (optional; points to entity_revision)
         extra: json blob for progeny metadata
 
     editgroup
-        editor_id
+        editor_id (editor table foreign key)
         description
         extra: json blob for progeny metadata
 
-Additional entity-specific columns would hold actual metadata. Additional
-tables (which would reference both `entity_revision` and `entity_id` foreign
-keys as appropriate) would represent things like authorship relationships
+An individual entity can be in the following "states", from which the given
+actions (transistion) can be made:
+
+- `wip` (not live; not redirect; has rev)
+    - activate (to `active`)
+- `active` (live; not redirect; has rev)
+    - redirect (to `redirect`)
+    - delete (to `deleted`)
+- `redirect` (live; redirect; rev or not)
+    - split (to `active`)
+    - delete (to `delete`)
+- `deleted` (live; not redirect; no rev)
+    - redirect (to `redirect`)
+    - activate (to `active`)
+
+"WIP, redirect" or "WIP, deleted" are invalid states.
+
+Additional entity-specific columns hold actual metadata. Additional
+tables (which reference both `entity_revision` and `entity_id` foreign
+keys as appropriate) represent things like authorship relationships
 (creator/release), citations between works, etc. Every revision of an entity
-would require duplicating all of these associated rows, which could end up
+requires duplicating all of these associated rows, which could end up
 being a large source of inefficiency, but is necessary to represent the full
 history of an object.
 
-## Ontology
-
-Loosely following FRBR (Functional Requirements for Bibliographic Records), but
-removing the "manifestation" abstraction, and favoring files (digital
-artifacts) over physical items, the primary entities are:
-
-    work
-        <a stub, for grouping releases>
-
-    release (aka "edition", "variant")
-        title
-        volume/pages/issue/chapter
-        media/formfactor
-        publication/peer-review status
-        language
-        <published> date
-        <variant-of> work
-        <published-in> container
-        <has-contributors> creator
-        <citation-to> release
-        <has> identifier
-
-    file (aka "digital artifact")
-        <instantiates> release
-        hashes/checksums
-        mimetype
-        <found-at> URLs
-
-    creator (aka "author")
-        name
-        identifiers
-        aliases
-
-    container (aka "venue", "serial", "title")
-        name
-        open-access policy
-        peer-review policy
-        <has> aliases, acronyms
-        <about> subject/category
-        <has> identifier
-        <published-in> container
-        <published-by> publisher
-
-## Controlled Vocabularies
-
-Some special namespace tables and enums would probably be helpful; these could
-live in the database (not requiring a database migration to update), but should
-have more controlled editing workflow... perhaps versioned in the codebase:
+## Controlled Vocabularies 
+
+Some individual fields have additional contraints, either in the form of
+pattern validation ("values must be upper case, contain only certain
+characters"), or membership in a fixed set of values. These may include:
 
-- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers
-  themselves)
 - subject categorization
 - license and open access status
 - work "types" (article vs. book chapter vs. proceeding, etc)
 - contributor types (author, translator, illustrator, etc)
 - human languages
-- file mimetypes
-
-These could also be enforced by QA bots that review all editgroups.
-
-## Entity States
+- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers
+  themselves)
 
-    wip (not live; not redirect; has rev)
-      activate
-    active (live; not redirect; has rev)
-      redirect
-      delete
-    redirect (live; redirect; rev or not)
-      split
-      delete
-    deleted (live; not redirect; no rev)
-      redirect
-      activate
+Other fixed-set "vocabularies" become too large to easily maintain or express
+in code. These could be added to the backend databases, or be enforced by bots
+(instead of the core system itself). These mostly include externally-registered identifiers or types, such as:
 
-    "wip redirect" or "wip deleted" are invalid states
+- file mimetypes
+- identifiers themselves (DOI, ORCID, etc), by checking for registeration
+  against canonical APIs and databases
 
 ## Global Edit Changelog
 
-As part of the process of "accepting" an edit group, a row would be written to
-an immutable, append-only log table (which internally could be a SQL table)
-documenting each identifier change. This changelog establishes a monotonically
-increasing version number for the entire corpus, and should make interaction
-with other systems easier (eg, search engines, replicated databases,
-alternative storage backends, notification frameworks, etc.).
+As part of the process of "accepting" an edit group, a row is written to an
+immutable, append-only table (which internally is a SQL table) documenting each
+identifier change. This changelog establishes a monotonically increasing
+version number for the entire corpus, and should make interaction with other
+systems easier (eg, search engines, replicated databases, alternative storage
+backends, notification frameworks, etc.).
author	Bryan Newbold <bnewbold@robocracy.org>	2018-09-21 12:33:35 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2018-09-21 12:34:06 -0700
commit	1915c7b885641a34191efeee2cc8525a6ad27b9f (patch)
tree	c26b8a772d8e79689b0b7bf6498590d517717ece /guide/src/data_model.md
parent	a1e5acf125decc0f2af28beca43e91b4085cc3d9 (diff)
download	fatcat-1915c7b885641a34191efeee2cc8525a6ad27b9f.tar.gz fatcat-1915c7b885641a34191efeee2cc8525a6ad27b9f.zip