summaryrefslogtreecommitdiffstats
path: root/rfc.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2018-06-17 13:27:59 -0700
committerBryan Newbold <bnewbold@robocracy.org>2018-06-17 13:27:59 -0700
commitc6dd01e26b74e066437821575ca6afd3ab9b07fc (patch)
tree19a825aa1900c469c31c8698ef029420943e7507 /rfc.md
parent7a00025afb5dbcef28f34dec04965437353723c8 (diff)
downloadfatcat-c6dd01e26b74e066437821575ca6afd3ab9b07fc.tar.gz
fatcat-c6dd01e26b74e066437821575ca6afd3ab9b07fc.zip
rename RFC document
Diffstat (limited to 'rfc.md')
-rw-r--r--rfc.md363
1 files changed, 0 insertions, 363 deletions
diff --git a/rfc.md b/rfc.md
deleted file mode 100644
index 21495f6d..00000000
--- a/rfc.md
+++ /dev/null
@@ -1,363 +0,0 @@
-fatcat is a half-baked idea to build an open, independent, collaboratively
-editable bibliographic database of most written works, with a focus on
-published research outputs like journal articles, pre-prints, and conference
-proceedings.
-
-## Technical Architecture
-
-The canonical backend datastore would be a very large transactional SQL server.
-A relatively simple and stable back-end daemon would expose an API (could be
-REST, GraphQL, gRPC, etc). As little "application logic" as possible would be
-embedded in this back-end; as much as possible would be pushed to bots which
-could be authored and operated by anybody. A separate web interface project
-would talk to the API backend and could be developed more rapidly.
-
-A cronjob would make periodic database dumps, both in "full" form (all tables
-and all edit history, removing only authentication credentials) and "flat" form
-(with only the most recent version of each entity, using only persistent IDs
-between entities).
-
-A goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not
-necessarily "first". It should be possible to export the database in a
-relatively clean RDF form, and to fetch data in a variety of formats, but
-internally fatcat would not be backed by a triple-store, and would not be
-bound to a specific third-party ontology or schema.
-
-Microservice daemons should be able to proxy between the primary API and
-standard protocols like ResourceSync and OAI-PMH, and bots could consume
-external databases in those formats.
-
-## Licensing
-
-The core fatcat database should only contain verifiable factual statements
-(which isn't to say that all statements are "true"), not creative or derived
-content.
-
-The goal is to have a very permissively licensed database: CC-0 (no rights
-reserved) if possible. Under US law, it should be possible to scrape and pull
-in factual data from other corpuses without adopting their licenses. The goal
-here isn't to avoid all attribution (progeny information will be included, and a
-large sources and acknowledgments statement should be maintained), but trying
-to manage the intersection of all upstream source licenses seems untenable, and
-creates burdens for downstream users.
-
-Special care will need to be taken around copyright and original works. I would
-propose either not accepting abstracts at all, or including them in a
-partitioned database to prevent copyright contamination. Likewise, even simple
-user-created content like lists, reviews, ratings, comments, discussion,
-documentation, etc., should go in separate services.
-
-## Basic Editing Workflow and Bots
-
-Both human editors and bots would have edits go through the same API, with
-humans using either the default web interface, arbitrary integrations, or
-client software.
-
-The usual workflow would be to create edits (or creations, merges, deletions)
-to individual entities one at a time, all under a single "edit group" of
-related edits (eg, correcting authorship info for multiple works related to a
-single author). When ready, the editor would "submit" the edit group for
-review. During the review period, humans could vote (or veto/approve if they
-have higher permissions), and bots can perform automated checks. During this
-period the editor can make tweaks if necessary. After some fixed time period
-(72 hours?) with no changes and no blocking issues, the edit group would be
-auto-accepted, if no auto-resolvable merge-conflicts have arisen. This process
-balances editing labor (reviews are easy, but optional) against quality
-(cool-down period makes it easier to detect and prevent spam or out-of-control
-bots). Advanced permissions could allow some trusted human and bot editors to
-push through edits more rapidly.
-
-Bots would need to be tuned to have appropriate edit group sizes (eg, daily
-batches, instead of millions of works in a single edit) to make human QA and
-reverts possible.
-
-Data progeny and citation would be left to the edit history. In the case of
-importing external databases, the expectation would be that special-purpose
-bot accounts would be used. Human editors would leave edit messages to clarify
-their sources.
-
-A style guide (wiki), chat room, and discussion forum would be hosted as
-separate stand-alone services for editors to propose projects and debate
-process or scope changes. It would be best if these could use federated account
-authorization (oauth?) to have consistent account IDs across mediums.
-
-## Edit Log
-
-As part of the process of "accepting" an edit group, a row would be written to
-an immutable, append-only log table (which internally could be a SQL table)
-documenting each identifier change. This log establishes a monotonically
-increasing version number for the entire corpus, and should make interaction
-with other systems easier (eg, search engines, replicated databases,
-alternative storage backends, notification frameworks, etc.).
-
-## Identifiers
-
-A fixed number of first-class "entities" would be defined, with common
-behavior and schema layouts. These would all be semantic entities like "work",
-"release", "container", and "person".
-
-fatcat identifiers would be semantically meaningless fixed-length random numbers,
-usually represented in case-insensitive base32 format. Each entity type would
-have its own identifier namespace. Eg, 96-bit identifiers would have 20
-characters and look like:
-
- fcwork_rzga5b9cd7efgh04iljk
- https://fatcat.org/work/rzga5b9cd7efgh04iljk
-
-128-bit (UUID size) would have 26 characters:
-
- fcwork_rzga5b9cd7efgh04iljk8f3jvz
- https://fatcat.org/work/rzga5b9cd7efgh04iljk8f3jvz
-
-A 64-bit namespace is probably plenty though, and would work with most database
-Integer columns:
-
- fcwork_rzga5b9cd7efg
- https://fatcat.org/work/rzga5b9cd7efg
-
-The idea would be to only have fatcat identifiers be used to interlink between
-databases, *not* to supplant DOIs, ISBNs, handle, ARKs, and other "registered"
-persistent identifiers.
-
-## Entities and Internal Schema
-
-Internally, identifiers would be lightweight pointers to actual metadata
-objects, which can be thought of as "versions". The metadata objects themselves
-would be immutable once committed; the edit process is one of creating new
-objects and, if the edit is approved, pointing the identifier to the new
-version. Entities would reference between themselves by identifier.
-
-Edit objects represent a change to a single entity; edits get batched together
-into edit groups (like "commits" and "pull requests" in git parlance).
-
-SQL tables would probably look something like the following, though be specific
-to each entity type (eg, there would be an actual `work_revision` table, but
-not an actual `entity_revision` table):
-
- entity_id
- uuid
- current_revision
-
- entity_revision
- entity_id (bi-directional?)
- previous: entity_revision or none
- state: normal, redirect, deletion
- redirect_entity_id: optional
- extra: json blob
- edit_id
-
- edit
- mutable: boolean
- edit_group
- editor
-
- edit_group
-
-Additional type-specific columns would hold actual metadata. Additional tables
-(which would reference both `entity_revision` and `entity_id` foreign keys as
-appropriate) would represent things like external identifiers, ordered
-author/work relationships, citations between works, etc. Every revision of an
-entity would require duplicating all of these associated rows, which could end
-up being a large source of inefficiency, but is necessary to represent the full
-history of an object.
-
-## Scope
-
-Want the "scholarly web": the graph of works that cite other works. Certainly
-every work that is cited more than once and every work that both cites and is
-cited; "leaf nodes" and small islands might not be in scope.
-
-Focusing on written works, with some exceptions. Expect core media (for which we would pursue "completeness") to be:
-
- journal articles
- books
- conference proceedings
- technical memos
- dissertations
-
-Probably in scope:
-
- reports
- magazine articles
- published poetry
- essays
- government documents
- conference
- presentations (slides, video)
- datasets
-
-Probably not:
-
- patents
- court cases and legal documents
- manuals
- datasheets
- courses
-
-Definitely not:
-
- audio recordings
- tv show episodes
- musical scores
- advertisements
-
-Author, citation, and work disambiguation would be core tasks. Linking
-pre-prints to final publication is in scope.
-
-I'm much less interested in altmetrics, funding, and grant relationships than
-most existing databases in this space.
-
-fatcat would not include any fulltext content itself, even for cleanly licensed
-(open access) works, but would have "strong" (verified) links to fulltext
-content, and would include file-level metadata (like hashes and fingerprints)
-to help discovery and identify content from any source. Typed file-level links
-should make fatcat more useful for both humans and machines to quickly access
-fulltext content of a given mimetype than existing redirect or landing page
-systems.
-
-## Ontology
-
-Loosely following FRBR, but removing the "manifestation" abstraction, and
-favoring files (digital artifacts) over physical items, the primary entities
-are:
-
- work
- type
- <has> contributors
- <about> subject/category
- <has-primary> release
-
- release (aka "edition", "variant")
- title
- volume/pages/issue/chapter
- open-access status
- <published> date
- <of a> work
- <published-by> publisher
- <published in> container
- <has> contributors
- <citation> citetext <to> release
- <has> identifier
-
- file (aka "digital artifact")
- <of a> release
- <has> hashes
- <found at> URLs
- <held-at> institution <with> accession
-
- contributor
- name
- <has> aliases
- <has> affiliation <for> date span
- <has> identifier
-
- container
- name
- open-access policy
- peer-review policy
- <has> aliases, acronyms
- <about> subject/category
- <has> identifier
- <published in> container
- <published-by> publisher
-
- publisher
- name
- <has> aliases, acronyms
- <has> identifier
-
-## Controlled Vocabularies
-
-Some special namespace tables and enums would probably be helpful; these should
-live in the database (not requiring a database migration to update), but should
-have more controlled editing workflow... perhaps versioned in the codebase:
-
-- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc)
-- subject categorization
-- license and open access status
-- work "types" (article vs. book chapter vs. proceeding, etc)
-- contributor types (author, translator, illustrator, etc)
-- human languages
-- file mimetypes
-
-## Unresolved Questions
-
-How to handle translations of, eg, titles and author names? To be clear, not
-translations of works (which are just separate releases).
-
-Are bi-directional links a schema anti-pattern? Eg, should "work" point to a
-primary "release" (which itself points back to the work), or should "release"
-have a "is-primary" flag?
-
-Should `identifier` and `citation` be their own entities, referencing other
-entities by UUID instead of by revision? This could save a ton of database
-space and chunder.
-
-Should contributor/author contact information be retained? It could be very
-useful for disambiguation, but we don't want to build a huge database for
-spammers or "innovative" start-up marketing.
-
-Would general-purpose SQL databases like Postgres or MySQL scale well enough
-to hold several tables with billions of entries? Right from the start there
-are hundreds of millions of works and releases, many of which having dozens of
-citations, many authors, and many identifiers, and then we'll have potentially
-dozens of edits for each of these, which multiply out to `1e8 * 2e1 * 2e1 =
-4e10`, or 40 billion rows in the citation table. If each row was 32 bytes on
-average (uncompressed, not including index size), that would be 1.3 TByte on
-its own, larger than common SSD disk. I think a transactional SQL datastore is
-the right answer. In my experience locking and index rebuild times are usually
-the biggest scaling challenges; the largely-immutable architecture here should
-mitigate locking. Hopefully few indexes would be needed in the primary
-database, as user interfaces could rely on secondary read-only search engines
-for more complex queries and views.
-
-I see a tension between focus and scope creep. If a central database like
-fatcat doesn't support enough fields and metadata, then it will not be possible
-to completely import other corpuses, and this becomes "yet another" partial
-bibliographic database. On the other hand, accepting arbitrary data leads to
-other problems: sparseness increases (we have more "partial" data), potential
-for redundancy is high, humans will start editing content that might be
-bulk-replaced, etc.
-
-There might be a need to support "stub" references between entities. Eg, when
-adding citations from PDF extraction, the cited works are likely to be
-ambiguous. Could create "stub" works to be merged/resolved later, or could
-leave the citation hanging. Same with authors, containers (journals), etc.
-
-## References and Previous Work
-
-The closest overall analog of fatcat is [MusicBrainz][mb], a collaboratively
-edited music database. [Open Library][ol] is a very similar existing service,
-which exclusively contains book metadata.
-
-[Wikidata][wd] seems to be the most successful and actively edited/developed
-open bibliographic database at this time (early 2018), including the
-[wikicite][wikicite] conference and related Wikimedia/Wikipedia projects.
-Wikidata is a general purpose semantic database of entities, facts, and
-relationships; bibliographic metadata has become a large fraction of all
-content in recent years. The focus there seems to be linking knowledge
-(statements) to specific sources unambiguously. Potential advantages fatcat
-would have would be a focus on a specific scope (not a general-purpose database
-of entities) and a goal of completeness (capturing as many works and
-relationships as rapidly as possible). However, it might be better to just
-pitch in to the wikidata efforts.
-
-The technical design of fatcat is loosely inspired by the git
-branch/tag/commit/tree architecture, and specifically inspired by Oliver
-Charles' "New Edit System" [blog posts][nes-blog] from 2012.
-
-There are a whole bunch of proprietary, for-profit bibliographic databases,
-including Web of Science, Google Scholar, Microsoft Academic Graph, aminer,
-Scopus, and Dimensions. There are excellent field-limited databases like dblp,
-MEDLINE, and Semantic Scholar. There are some large general-purpose databases
-that are not directly user-editable, including the OpenCitation corpus, CORE,
-BASE, and CrossRef. I don't know of any large (more than 60 million works),
-open (bulk-downloadable with permissive or no license), field agnostic,
-user-editable corpus of scholarly publication bibliographic metadata.
-
-[nes-blog]: https://ocharles.org.uk/blog/posts/2012-07-10-nes-does-it-better-1.html
-[mb]: https://musicbrainz.org
-[ol]: https://openlibrary.org
-[wd]: https://wikidata.org
-[wikicite]: https://meta.wikimedia.org/wiki/WikiCite_2017
-