From 76ac2a96a6bd3910f8f4af18f79b539b1d29edf9 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 14 Feb 2019 12:24:55 -0800 Subject: provenance, not progeny --- python/fatcat_web/templates/rfc.html | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'python/fatcat_web') diff --git a/python/fatcat_web/templates/rfc.html b/python/fatcat_web/templates/rfc.html index 85f100b7..c7e7149f 100644 --- a/python/fatcat_web/templates/rfc.html +++ b/python/fatcat_web/templates/rfc.html @@ -28,13 +28,13 @@

Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the databse in those formats.

Licensing

The core fatcat database should only contain verifiable factual statements (which isn't to say that all statements are "true"), not creative or derived content.

-

The goal is to have a very permissively licensed database: CC-0 (no rights reserved) if possible. Under US law, it should be possible to scrape and pull in factual data from other corpuses without adopting their licenses. The goal here isn't to avoid attribution (progeny information will be included, and a large sources and acknowledgments statement should be maintained and shipped with bulk exports), but trying to manage the intersection of all upstream source licenses seems untenable, and creates burdens for downstream users and developers.

+

The goal is to have a very permissively licensed database: CC-0 (no rights reserved) if possible. Under US law, it should be possible to scrape and pull in factual data from other corpuses without adopting their licenses. The goal here isn't to avoid attribution (provenance information will be included, and a large sources and acknowledgments statement should be maintained and shipped with bulk exports), but trying to manage the intersection of all upstream source licenses seems untenable, and creates burdens for downstream users and developers.

Special care will need to be taken around copyright, "original work" by editors, and contributions that raise privacy concerns. If abstracts are stored at all, they should be in a partitioned database table to prevent copyright contamination. Likewise, even simple user-created content like lists, reviews, ratings, comments, discussion, documentation, etc., should live in separate services.

Basic Editing Workflow and Bots

Both human editors and bots should have edits go through the same API, with humans using either the default web interface, integrations, or client software.

The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an "edit group" of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor would "submit" the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (72 hours?) with no changes and no blocking issues, the edit group would be auto-accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).

Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts managable.

-

Data progeny and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors would leave edit messages to clarify their sources.

+

Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors would leave edit messages to clarify their sources.

A style guide (wiki) and discussion forum would be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (oauth?) to have consistent account IDs across all mediums.

Global Edit Changelog

As part of the process of "accepting" an edit group, a row would be written to an immutable, append-only log table (which internally could be a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).

@@ -72,12 +72,12 @@ entity_edit ident (entity_ident foreign key) new_revision (entity_revision foreign key) previous_revision (optional; points to entity_revision) - extra: json blob for progeny metadata + extra: json blob for provenance metadata editgroup editor_id description - extra: json blob for progeny metadata + extra: json blob for provenance metadata

Additional entity-specific columns would hold actual metadata. Additional tables (which would reference both entity_revision and entity_id foreign keys as appropriate) would represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity would require duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object.

Scope

The goal is to capture the "scholarly web": the graph of written works that cite other works. Any work that is both cited more than once and cites more than one other work in the catalog is very likely to be in scope. "Leaf nodes" and small islands of intra-cited works may or may not be in scope.

-- cgit v1.2.3