From 76ac2a96a6bd3910f8f4af18f79b539b1d29edf9 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 14 Feb 2019 12:24:55 -0800 Subject: provenance, not progeny --- guide/src/container_extra.md | 78 ++++++++++++++++++++++++++++++++++++++++++++ guide/src/cookbook.md | 2 +- guide/src/data_model.md | 4 +-- guide/src/policies.md | 6 ++-- guide/src/workflow.md | 2 +- 5 files changed, 85 insertions(+), 7 deletions(-) create mode 100644 guide/src/container_extra.md (limited to 'guide') diff --git a/guide/src/container_extra.md b/guide/src/container_extra.md new file mode 100644 index 00000000..224b7e8a --- /dev/null +++ b/guide/src/container_extra.md @@ -0,0 +1,78 @@ + +'extra' fields: + + doaj + as_of: datetime of most recent check; if not set, not actually in DOAJ + seal: bool + work_level: bool (are work-level publications deposited with DOAJ?) + archiving: array, can include 'library' or 'other' + road + as_of: datetime of most recent check; if not set, not actually in ROAD + pubmed (TODO: delete?) + as_of: datetime of most recent check; if not set, not actually indexed in pubmed + norwegian (TODO: drop this?) + as_of: datetime of most recent check; if not set, not actually indexed in pubmed + id (integer) + level (integer; 0-2) + kbart + lockss + year_rle + volume_rle + portico + ... + clockss + ... + sherpa_romeo + color + jstor + year_rle + volume_rle + scopus + id + TODO: print/electronic distinction? + wos + id + doi + crossref_doi: DOI of the title in crossref (if exists) + prefixes: array of strings (DOI prefixes, up to the '/'; any registrar, not just Crossref) + ia + sim + nap_id + year_rle + volume_rle + longtail: boolean + homepage + as_of: datetime of last attempt + url + status: HTTP/heritrix status of homepage crawl + + issnp: string + issne: string + coden: string + abbrev: string + oclc_id: string (TODO: lookup?) + lccn_id: string (TODO: lookup?) + dblb_id: string + default_license: slug + original_name: native name (if name is translated) + platform: hosting platform: OJS, wordpress, scielo, etc + mimetypes: array of strings (eg, 'application/pdf', 'text/html') + first_year: year (integer) + last_year: if publishing has stopped + primary_language: single ISO code, or 'mixed' + languages: array of ISO codes + region: TODO: continent/world-region + nation: shortcode of nation + discipline: TODO: highest-level subject; "life science", "humanities", etc + field: TODO: narrower description of field + subjects: TODO? + url: homepage + is_oa: boolean. If true, can assume all releases under this container are "Open Access" + TODO: domains, if exclusive? + TODO: fulltext_regex, if a known pattern? + +For KBART, etc: + We "over-count" on the assumption that "in-progress" status works will soon actually be preserved. + year and volume spans are run-length-encoded arrays, using integers: + - if an integer, means that year is preserved + - if an array of length 2, means everything between the two numbers (inclusive) is preserved diff --git a/guide/src/cookbook.md b/guide/src/cookbook.md index 74bffe59..03c2981a 100644 --- a/guide/src/cookbook.md +++ b/guide/src/cookbook.md @@ -33,5 +33,5 @@ When bootstrapping a blank catalog, we need to insert 10s or 100s of millions of entities as fast as possible. -1. Create (POST) a new editgroup, with progeny information included +1. Create (POST) a new editgroup, with provenance information included 2. Batch create (POST) entities diff --git a/guide/src/data_model.md b/guide/src/data_model.md index 2d6f7287..21d265e1 100644 --- a/guide/src/data_model.md +++ b/guide/src/data_model.md @@ -132,12 +132,12 @@ SQL tables look something like this (with separate tables for entity type a la new_revision (entity_revision foreign key) new_redirect (optional; points to entity_ident table) previous_revision (optional; points to entity_revision) - extra: json blob for progeny metadata + extra: json blob for provenance metadata editgroup editor_id (editor table foreign key) description - extra: json blob for progeny metadata + extra: json blob for provenance metadata An individual entity can be in the following "states", from which the given actions (transition) can be made: diff --git a/guide/src/policies.md b/guide/src/policies.md index 03e5e526..e61984be 100644 --- a/guide/src/policies.md +++ b/guide/src/policies.md @@ -16,13 +16,13 @@ The Fatcat catalog content license is the Creative Commons Zero ("CC-0") license, which is effectively a public domain grant. This applies to the catalog metadata itself (titles, entity relationships, citation metadata, URLs, hashes, identifiers), as well as "meta-meta-data" provided by editors (edit -descriptions, progeny metadata, etc). +descriptions, provenance metadata, etc). The core catalog is designed to contain only factual information: "this work, known by this title and with these third-party identifiers, is believed to be represented by these files and published under such-and-such venue". As a norm, -sourcing metadata (for attribution and progeny) is retained for each edit made -to the catalog. +sourcing metadata (for attribution and provenance) is retained for each edit +made to the catalog. A notable exception to this policy are abstracts, for which no copyright claims or license is made. Abstract content is kept separate from core catalog diff --git a/guide/src/workflow.md b/guide/src/workflow.md index 996fb24c..94842e54 100644 --- a/guide/src/workflow.md +++ b/guide/src/workflow.md @@ -24,7 +24,7 @@ Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts manageable. -Data progeny and source references are captured in the edit metadata, instead +Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. -- cgit v1.2.3