From c6dd01e26b74e066437821575ca6afd3ab9b07fc Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Sun, 17 Jun 2018 13:27:59 -0700 Subject: rename RFC document --- rfc.md | 363 ----------------------------------------------------------------- 1 file changed, 363 deletions(-) delete mode 100644 rfc.md (limited to 'rfc.md') diff --git a/rfc.md b/rfc.md deleted file mode 100644 index 21495f6d..00000000 --- a/rfc.md +++ /dev/null @@ -1,363 +0,0 @@ -fatcat is a half-baked idea to build an open, independent, collaboratively -editable bibliographic database of most written works, with a focus on -published research outputs like journal articles, pre-prints, and conference -proceedings. - -## Technical Architecture - -The canonical backend datastore would be a very large transactional SQL server. -A relatively simple and stable back-end daemon would expose an API (could be -REST, GraphQL, gRPC, etc). As little "application logic" as possible would be -embedded in this back-end; as much as possible would be pushed to bots which -could be authored and operated by anybody. A separate web interface project -would talk to the API backend and could be developed more rapidly. - -A cronjob would make periodic database dumps, both in "full" form (all tables -and all edit history, removing only authentication credentials) and "flat" form -(with only the most recent version of each entity, using only persistent IDs -between entities). - -A goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not -necessarily "first". It should be possible to export the database in a -relatively clean RDF form, and to fetch data in a variety of formats, but -internally fatcat would not be backed by a triple-store, and would not be -bound to a specific third-party ontology or schema. - -Microservice daemons should be able to proxy between the primary API and -standard protocols like ResourceSync and OAI-PMH, and bots could consume -external databases in those formats. - -## Licensing - -The core fatcat database should only contain verifiable factual statements -(which isn't to say that all statements are "true"), not creative or derived -content. - -The goal is to have a very permissively licensed database: CC-0 (no rights -reserved) if possible. Under US law, it should be possible to scrape and pull -in factual data from other corpuses without adopting their licenses. The goal -here isn't to avoid all attribution (progeny information will be included, and a -large sources and acknowledgments statement should be maintained), but trying -to manage the intersection of all upstream source licenses seems untenable, and -creates burdens for downstream users. - -Special care will need to be taken around copyright and original works. I would -propose either not accepting abstracts at all, or including them in a -partitioned database to prevent copyright contamination. Likewise, even simple -user-created content like lists, reviews, ratings, comments, discussion, -documentation, etc., should go in separate services. - -## Basic Editing Workflow and Bots - -Both human editors and bots would have edits go through the same API, with -humans using either the default web interface, arbitrary integrations, or -client software. - -The usual workflow would be to create edits (or creations, merges, deletions) -to individual entities one at a time, all under a single "edit group" of -related edits (eg, correcting authorship info for multiple works related to a -single author). When ready, the editor would "submit" the edit group for -review. During the review period, humans could vote (or veto/approve if they -have higher permissions), and bots can perform automated checks. During this -period the editor can make tweaks if necessary. After some fixed time period -(72 hours?) with no changes and no blocking issues, the edit group would be -auto-accepted, if no auto-resolvable merge-conflicts have arisen. This process -balances editing labor (reviews are easy, but optional) against quality -(cool-down period makes it easier to detect and prevent spam or out-of-control -bots). Advanced permissions could allow some trusted human and bot editors to -push through edits more rapidly. - -Bots would need to be tuned to have appropriate edit group sizes (eg, daily -batches, instead of millions of works in a single edit) to make human QA and -reverts possible. - -Data progeny and citation would be left to the edit history. In the case of -importing external databases, the expectation would be that special-purpose -bot accounts would be used. Human editors would leave edit messages to clarify -their sources. - -A style guide (wiki), chat room, and discussion forum would be hosted as -separate stand-alone services for editors to propose projects and debate -process or scope changes. It would be best if these could use federated account -authorization (oauth?) to have consistent account IDs across mediums. - -## Edit Log - -As part of the process of "accepting" an edit group, a row would be written to -an immutable, append-only log table (which internally could be a SQL table) -documenting each identifier change. This log establishes a monotonically -increasing version number for the entire corpus, and should make interaction -with other systems easier (eg, search engines, replicated databases, -alternative storage backends, notification frameworks, etc.). - -## Identifiers - -A fixed number of first-class "entities" would be defined, with common -behavior and schema layouts. These would all be semantic entities like "work", -"release", "container", and "person". - -fatcat identifiers would be semantically meaningless fixed-length random numbers, -usually represented in case-insensitive base32 format. Each entity type would -have its own identifier namespace. Eg, 96-bit identifiers would have 20 -characters and look like: - - fcwork_rzga5b9cd7efgh04iljk - https://fatcat.org/work/rzga5b9cd7efgh04iljk - -128-bit (UUID size) would have 26 characters: - - fcwork_rzga5b9cd7efgh04iljk8f3jvz - https://fatcat.org/work/rzga5b9cd7efgh04iljk8f3jvz - -A 64-bit namespace is probably plenty though, and would work with most database -Integer columns: - - fcwork_rzga5b9cd7efg - https://fatcat.org/work/rzga5b9cd7efg - -The idea would be to only have fatcat identifiers be used to interlink between -databases, *not* to supplant DOIs, ISBNs, handle, ARKs, and other "registered" -persistent identifiers. - -## Entities and Internal Schema - -Internally, identifiers would be lightweight pointers to actual metadata -objects, which can be thought of as "versions". The metadata objects themselves -would be immutable once committed; the edit process is one of creating new -objects and, if the edit is approved, pointing the identifier to the new -version. Entities would reference between themselves by identifier. - -Edit objects represent a change to a single entity; edits get batched together -into edit groups (like "commits" and "pull requests" in git parlance). - -SQL tables would probably look something like the following, though be specific -to each entity type (eg, there would be an actual `work_revision` table, but -not an actual `entity_revision` table): - - entity_id - uuid - current_revision - - entity_revision - entity_id (bi-directional?) - previous: entity_revision or none - state: normal, redirect, deletion - redirect_entity_id: optional - extra: json blob - edit_id - - edit - mutable: boolean - edit_group - editor - - edit_group - -Additional type-specific columns would hold actual metadata. Additional tables -(which would reference both `entity_revision` and `entity_id` foreign keys as -appropriate) would represent things like external identifiers, ordered -author/work relationships, citations between works, etc. Every revision of an -entity would require duplicating all of these associated rows, which could end -up being a large source of inefficiency, but is necessary to represent the full -history of an object. - -## Scope - -Want the "scholarly web": the graph of works that cite other works. Certainly -every work that is cited more than once and every work that both cites and is -cited; "leaf nodes" and small islands might not be in scope. - -Focusing on written works, with some exceptions. Expect core media (for which we would pursue "completeness") to be: - - journal articles - books - conference proceedings - technical memos - dissertations - -Probably in scope: - - reports - magazine articles - published poetry - essays - government documents - conference - presentations (slides, video) - datasets - -Probably not: - - patents - court cases and legal documents - manuals - datasheets - courses - -Definitely not: - - audio recordings - tv show episodes - musical scores - advertisements - -Author, citation, and work disambiguation would be core tasks. Linking -pre-prints to final publication is in scope. - -I'm much less interested in altmetrics, funding, and grant relationships than -most existing databases in this space. - -fatcat would not include any fulltext content itself, even for cleanly licensed -(open access) works, but would have "strong" (verified) links to fulltext -content, and would include file-level metadata (like hashes and fingerprints) -to help discovery and identify content from any source. Typed file-level links -should make fatcat more useful for both humans and machines to quickly access -fulltext content of a given mimetype than existing redirect or landing page -systems. - -## Ontology - -Loosely following FRBR, but removing the "manifestation" abstraction, and -favoring files (digital artifacts) over physical items, the primary entities -are: - - work - type - contributors - subject/category - release - - release (aka "edition", "variant") - title - volume/pages/issue/chapter - open-access status - date - work - publisher - container - contributors - citetext release - identifier - - file (aka "digital artifact") - release - hashes - URLs - institution accession - - contributor - name - aliases - affiliation date span - identifier - - container - name - open-access policy - peer-review policy - aliases, acronyms - subject/category - identifier - container - publisher - - publisher - name - aliases, acronyms - identifier - -## Controlled Vocabularies - -Some special namespace tables and enums would probably be helpful; these should -live in the database (not requiring a database migration to update), but should -have more controlled editing workflow... perhaps versioned in the codebase: - -- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc) -- subject categorization -- license and open access status -- work "types" (article vs. book chapter vs. proceeding, etc) -- contributor types (author, translator, illustrator, etc) -- human languages -- file mimetypes - -## Unresolved Questions - -How to handle translations of, eg, titles and author names? To be clear, not -translations of works (which are just separate releases). - -Are bi-directional links a schema anti-pattern? Eg, should "work" point to a -primary "release" (which itself points back to the work), or should "release" -have a "is-primary" flag? - -Should `identifier` and `citation` be their own entities, referencing other -entities by UUID instead of by revision? This could save a ton of database -space and chunder. - -Should contributor/author contact information be retained? It could be very -useful for disambiguation, but we don't want to build a huge database for -spammers or "innovative" start-up marketing. - -Would general-purpose SQL databases like Postgres or MySQL scale well enough -to hold several tables with billions of entries? Right from the start there -are hundreds of millions of works and releases, many of which having dozens of -citations, many authors, and many identifiers, and then we'll have potentially -dozens of edits for each of these, which multiply out to `1e8 * 2e1 * 2e1 = -4e10`, or 40 billion rows in the citation table. If each row was 32 bytes on -average (uncompressed, not including index size), that would be 1.3 TByte on -its own, larger than common SSD disk. I think a transactional SQL datastore is -the right answer. In my experience locking and index rebuild times are usually -the biggest scaling challenges; the largely-immutable architecture here should -mitigate locking. Hopefully few indexes would be needed in the primary -database, as user interfaces could rely on secondary read-only search engines -for more complex queries and views. - -I see a tension between focus and scope creep. If a central database like -fatcat doesn't support enough fields and metadata, then it will not be possible -to completely import other corpuses, and this becomes "yet another" partial -bibliographic database. On the other hand, accepting arbitrary data leads to -other problems: sparseness increases (we have more "partial" data), potential -for redundancy is high, humans will start editing content that might be -bulk-replaced, etc. - -There might be a need to support "stub" references between entities. Eg, when -adding citations from PDF extraction, the cited works are likely to be -ambiguous. Could create "stub" works to be merged/resolved later, or could -leave the citation hanging. Same with authors, containers (journals), etc. - -## References and Previous Work - -The closest overall analog of fatcat is [MusicBrainz][mb], a collaboratively -edited music database. [Open Library][ol] is a very similar existing service, -which exclusively contains book metadata. - -[Wikidata][wd] seems to be the most successful and actively edited/developed -open bibliographic database at this time (early 2018), including the -[wikicite][wikicite] conference and related Wikimedia/Wikipedia projects. -Wikidata is a general purpose semantic database of entities, facts, and -relationships; bibliographic metadata has become a large fraction of all -content in recent years. The focus there seems to be linking knowledge -(statements) to specific sources unambiguously. Potential advantages fatcat -would have would be a focus on a specific scope (not a general-purpose database -of entities) and a goal of completeness (capturing as many works and -relationships as rapidly as possible). However, it might be better to just -pitch in to the wikidata efforts. - -The technical design of fatcat is loosely inspired by the git -branch/tag/commit/tree architecture, and specifically inspired by Oliver -Charles' "New Edit System" [blog posts][nes-blog] from 2012. - -There are a whole bunch of proprietary, for-profit bibliographic databases, -including Web of Science, Google Scholar, Microsoft Academic Graph, aminer, -Scopus, and Dimensions. There are excellent field-limited databases like dblp, -MEDLINE, and Semantic Scholar. There are some large general-purpose databases -that are not directly user-editable, including the OpenCitation corpus, CORE, -BASE, and CrossRef. I don't know of any large (more than 60 million works), -open (bulk-downloadable with permissive or no license), field agnostic, -user-editable corpus of scholarly publication bibliographic metadata. - -[nes-blog]: https://ocharles.org.uk/blog/posts/2012-07-10-nes-does-it-better-1.html -[mb]: https://musicbrainz.org -[ol]: https://openlibrary.org -[wd]: https://wikidata.org -[wikicite]: https://meta.wikimedia.org/wiki/WikiCite_2017 - -- cgit v1.2.3