From a2f2a9b250bb845f12d34d6892f1d7d0a50c3b7b Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 24 Nov 2021 15:44:02 -0800 Subject: codespell fixes in web interface templates --- python/fatcat_web/templates/rfc.html | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'python/fatcat_web/templates/rfc.html') diff --git a/python/fatcat_web/templates/rfc.html b/python/fatcat_web/templates/rfc.html index c7e7149f..fba6eff3 100644 --- a/python/fatcat_web/templates/rfc.html +++ b/python/fatcat_web/templates/rfc.html @@ -25,7 +25,7 @@

As little "application logic" as possible should be embedded in this back-end; as much as possible would be pushed to bots which could be authored and operated by anybody. A separate web interface project talks to the API backend and can be developed more rapidly with less concern about data loss or corruption.

A cronjob will creae periodic database dumps, both in "full" form (all tables and all edit history, removing only authentication credentials) and "flattened" form (with only the most recent version of each entity).

A goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not necessarily "first". It should be possible to export the database in a relatively clean RDF form, and to fetch data in a variety of formats, but internally fatcat will not be backed by a triple-store, and will not be bound to a rigid third-party ontology or schema.

-

Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the databse in those formats.

+

Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the database in those formats.

Licensing

The core fatcat database should only contain verifiable factual statements (which isn't to say that all statements are "true"), not creative or derived content.

The goal is to have a very permissively licensed database: CC-0 (no rights reserved) if possible. Under US law, it should be possible to scrape and pull in factual data from other corpuses without adopting their licenses. The goal here isn't to avoid attribution (provenance information will be included, and a large sources and acknowledgments statement should be maintained and shipped with bulk exports), but trying to manage the intersection of all upstream source licenses seems untenable, and creates burdens for downstream users and developers.

@@ -33,7 +33,7 @@

Basic Editing Workflow and Bots

Both human editors and bots should have edits go through the same API, with humans using either the default web interface, integrations, or client software.

The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an "edit group" of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor would "submit" the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (72 hours?) with no changes and no blocking issues, the edit group would be auto-accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).

-

Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts managable.

+

Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts manageable.

Data provenance and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors would leave edit messages to clarify their sources.

A style guide (wiki) and discussion forum would be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (oauth?) to have consistent account IDs across all mediums.

Global Edit Changelog

@@ -47,13 +47,13 @@ https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz

In comparison, 96-bit identifiers would have 20 characters and look like:

work_rzga5b9cd7efgh04iljk
 https://fatcat.wiki/work/rzga5b9cd7efgh04iljk
-

A 64-bit namespace would probably be large enought, and would work with database Integer columns:

+

A 64-bit namespace would probably be large enough, and would work with database Integer columns:

work_rzga5b9cd7efg
 https://fatcat.wiki/work/rzga5b9cd7efg

The idea would be to only have fatcat identifiers be used to interlink between databases, not to supplant DOIs, ISBNs, handle, ARKs, and other "registered" persistent identifiers.

Entities and Internal Schema

Internally, identifiers would be lightweight pointers to "revisions" of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems, this is the git model, not the mercurial model.

-

The entity revisions are immutable once accepted; the editting process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by identifier not revision number. Identifier pointers also support (versioned) deletion and redirects (for merging entities).

+

The entity revisions are immutable once accepted; the editing process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by identifier not revision number. Identifier pointers also support (versioned) deletion and redirects (for merging entities).

Edit objects represent a change to a single entity; edits get batched together into edit groups (like "commits" and "pull requests" in git parlance).

SQL tables would probably look something like the (but specific to each entity type, with tables like work_revision not entity_revision):

entity_ident
@@ -158,7 +158,7 @@ container (aka "venue", "serial", "title")
 

Controlled Vocabularies

Some special namespace tables and enums would probably be helpful; these could live in the database (not requiring a database migration to update), but should have more controlled editing workflow... perhaps versioned in the codebase:

    -
  • identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers themselves)
  • +
  • identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifiers themselves)
  • subject categorization
  • license and open access status
  • work "types" (article vs. book chapter vs. proceeding, etc)
  • -- cgit v1.2.3