From 76ac2a96a6bd3910f8f4af18f79b539b1d29edf9 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 14 Feb 2019 12:24:55 -0800 Subject: provenance, not progeny --- fatcat-rfc.md | 8 ++-- guide/src/container_extra.md | 78 ++++++++++++++++++++++++++++++++++++ guide/src/cookbook.md | 2 +- guide/src/data_model.md | 4 +- guide/src/policies.md | 6 +-- guide/src/workflow.md | 2 +- python/fatcat_web/templates/rfc.html | 8 ++-- 7 files changed, 93 insertions(+), 15 deletions(-) create mode 100644 guide/src/container_extra.md diff --git a/fatcat-rfc.md b/fatcat-rfc.md index d79f682d..13466df2 100644 --- a/fatcat-rfc.md +++ b/fatcat-rfc.md @@ -74,7 +74,7 @@ content. The goal is to have a very permissively licensed database: CC-0 (no rights reserved) if possible. Under US law, it should be possible to scrape and pull in factual data from other corpuses without adopting their licenses. The goal -here isn't to avoid attribution (progeny information will be included, and a +here isn't to avoid attribution (provenance information will be included, and a large sources and acknowledgments statement should be maintained and shipped with bulk exports), but trying to manage the intersection of all upstream source licenses seems untenable, and creates burdens for downstream users and @@ -111,7 +111,7 @@ Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts managable. -Data progeny and source references are captured in the edit metadata, instead +Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. @@ -198,12 +198,12 @@ type, with tables like `work_revision` not `entity_revision`): ident (entity_ident foreign key) new_revision (entity_revision foreign key) previous_revision (optional; points to entity_revision) - extra: json blob for progeny metadata + extra: json blob for provenance metadata editgroup editor_id description - extra: json blob for progeny metadata + extra: json blob for provenance metadata Additional entity-specific columns would hold actual metadata. Additional tables (which would reference both `entity_revision` and `entity_id` foreign diff --git a/guide/src/container_extra.md b/guide/src/container_extra.md new file mode 100644 index 00000000..224b7e8a --- /dev/null +++ b/guide/src/container_extra.md @@ -0,0 +1,78 @@ + +'extra' fields: + + doaj + as_of: datetime of most recent check; if not set, not actually in DOAJ + seal: bool + work_level: bool (are work-level publications deposited with DOAJ?) + archiving: array, can include 'library' or 'other' + road + as_of: datetime of most recent check; if not set, not actually in ROAD + pubmed (TODO: delete?) + as_of: datetime of most recent check; if not set, not actually indexed in pubmed + norwegian (TODO: drop this?) + as_of: datetime of most recent check; if not set, not actually indexed in pubmed + id (integer) + level (integer; 0-2) + kbart + lockss + year_rle + volume_rle + portico + ... + clockss + ... + sherpa_romeo + color + jstor + year_rle + volume_rle + scopus + id + TODO: print/electronic distinction? + wos + id + doi + crossref_doi: DOI of the title in crossref (if exists) + prefixes: array of strings (DOI prefixes, up to the '/'; any registrar, not just Crossref) + ia + sim + nap_id + year_rle + volume_rle + longtail: boolean + homepage + as_of: datetime of last attempt + url + status: HTTP/heritrix status of homepage crawl + + issnp: string + issne: string + coden: string + abbrev: string + oclc_id: string (TODO: lookup?) + lccn_id: string (TODO: lookup?) + dblb_id: string + default_license: slug + original_name: native name (if name is translated) + platform: hosting platform: OJS, wordpress, scielo, etc + mimetypes: array of strings (eg, 'application/pdf', 'text/html') + first_year: year (integer) + last_year: if publishing has stopped + primary_language: single ISO code, or 'mixed' + languages: array of ISO codes + region: TODO: continent/world-region + nation: shortcode of nation + discipline: TODO: highest-level subject; "life science", "humanities", etc + field: TODO: narrower description of field + subjects: TODO? + url: homepage + is_oa: boolean. If true, can assume all releases under this container are "Open Access" + TODO: domains, if exclusive? + TODO: fulltext_regex, if a known pattern? + +For KBART, etc: + We "over-count" on the assumption that "in-progress" status works will soon actually be preserved. + year and volume spans are run-length-encoded arrays, using integers: + - if an integer, means that year is preserved + - if an array of length 2, means everything between the two numbers (inclusive) is preserved diff --git a/guide/src/cookbook.md b/guide/src/cookbook.md index 74bffe59..03c2981a 100644 --- a/guide/src/cookbook.md +++ b/guide/src/cookbook.md @@ -33,5 +33,5 @@ When bootstrapping a blank catalog, we need to insert 10s or 100s of millions of entities as fast as possible. -1. Create (POST) a new editgroup, with progeny information included +1. Create (POST) a new editgroup, with provenance information included 2. Batch create (POST) entities diff --git a/guide/src/data_model.md b/guide/src/data_model.md index 2d6f7287..21d265e1 100644 --- a/guide/src/data_model.md +++ b/guide/src/data_model.md @@ -132,12 +132,12 @@ SQL tables look something like this (with separate tables for entity type a la new_revision (entity_revision foreign key) new_redirect (optional; points to entity_ident table) previous_revision (optional; points to entity_revision) - extra: json blob for progeny metadata + extra: json blob for provenance metadata editgroup editor_id (editor table foreign key) description - extra: json blob for progeny metadata + extra: json blob for provenance metadata An individual entity can be in the following "states", from which the given actions (transition) can be made: diff --git a/guide/src/policies.md b/guide/src/policies.md index 03e5e526..e61984be 100644 --- a/guide/src/policies.md +++ b/guide/src/policies.md @@ -16,13 +16,13 @@ The Fatcat catalog content license is the Creative Commons Zero ("CC-0") license, which is effectively a public domain grant. This applies to the catalog metadata itself (titles, entity relationships, citation metadata, URLs, hashes, identifiers), as well as "meta-meta-data" provided by editors (edit -descriptions, progeny metadata, etc). +descriptions, provenance metadata, etc). The core catalog is designed to contain only factual information: "this work, known by this title and with these third-party identifiers, is believed to be represented by these files and published under such-and-such venue". As a norm, -sourcing metadata (for attribution and progeny) is retained for each edit made -to the catalog. +sourcing metadata (for attribution and provenance) is retained for each edit +made to the catalog. A notable exception to this policy are abstracts, for which no copyright claims or license is made. Abstract content is kept separate from core catalog diff --git a/guide/src/workflow.md b/guide/src/workflow.md index 996fb24c..94842e54 100644 --- a/guide/src/workflow.md +++ b/guide/src/workflow.md @@ -24,7 +24,7 @@ Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts manageable. -Data progeny and source references are captured in the edit metadata, instead +Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. diff --git a/python/fatcat_web/templates/rfc.html b/python/fatcat_web/templates/rfc.html index 85f100b7..c7e7149f 100644 --- a/python/fatcat_web/templates/rfc.html +++ b/python/fatcat_web/templates/rfc.html @@ -28,13 +28,13 @@

Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the databse in those formats.

Licensing

The core fatcat database should only contain verifiable factual statements (which isn't to say that all statements are "true"), not creative or derived content.

-

The goal is to have a very permissively licensed database: CC-0 (no rights reserved) if possible. Under US law, it should be possible to scrape and pull in factual data from other corpuses without adopting their licenses. The goal here isn't to avoid attribution (progeny information will be included, and a large sources and acknowledgments statement should be maintained and shipped with bulk exports), but trying to manage the intersection of all upstream source licenses seems untenable, and creates burdens for downstream users and developers.

+

The goal is to have a very permissively licensed database: CC-0 (no rights reserved) if possible. Under US law, it should be possible to scrape and pull in factual data from other corpuses without adopting their licenses. The goal here isn't to avoid attribution (provenance information will be included, and a large sources and acknowledgments statement should be maintained and shipped with bulk exports), but trying to manage the intersection of all upstream source licenses seems untenable, and creates burdens for downstream users and developers.

Special care will need to be taken around copyright, "original work" by editors, and contributions that raise privacy concerns. If abstracts are stored at all, they should be in a partitioned database table to prevent copyright contamination. Likewise, even simple user-created content like lists, reviews, ratings, comments, discussion, documentation, etc., should live in separate services.

Basic Editing Workflow and Bots

Both human editors and bots should have edits go through the same API, with humans using either the default web interface, integrations, or client software.

The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an "edit group" of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor would "submit" the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (72 hours?) with no changes and no blocking issues, the edit group would be auto-accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).

Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts managable.

-

Data progeny and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors would leave edit messages to clarify their sources.

+

Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors would leave edit messages to clarify their sources.

A style guide (wiki) and discussion forum would be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (oauth?) to have consistent account IDs across all mediums.

Global Edit Changelog

As part of the process of "accepting" an edit group, a row would be written to an immutable, append-only log table (which internally could be a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).

@@ -72,12 +72,12 @@ entity_edit ident (entity_ident foreign key) new_revision (entity_revision foreign key) previous_revision (optional; points to entity_revision) - extra: json blob for progeny metadata + extra: json blob for provenance metadata editgroup editor_id description - extra: json blob for progeny metadata + extra: json blob for provenance metadata

Additional entity-specific columns would hold actual metadata. Additional tables (which would reference both entity_revision and entity_id foreign keys as appropriate) would represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity would require duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object.

Scope

The goal is to capture the "scholarly web": the graph of written works that cite other works. Any work that is both cited more than once and cites more than one other work in the catalog is very likely to be in scope. "Leaf nodes" and small islands of intra-cited works may or may not be in scope.

-- cgit v1.2.3