rfc updates

author: Bryan Newbold <bnewbold@robocracy.org> 2018-01-17 14:42:55 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2018-01-17 14:42:55 -0800
commit: f1338cf0aaaacda1305c76cfac6b43b55aae4fc8 (patch)
tree: b2fc4bef722b16a71cc09d39263f97426180f810
parent: 027f1639ecf29f9e8d5e9b605e1b3ecb4e65139a (diff)
download: fatcat-f1338cf0aaaacda1305c76cfac6b43b55aae4fc8.tar.gz
fatcat-f1338cf0aaaacda1305c76cfac6b43b55aae4fc8.zip
1 files changed, 200 insertions, 4 deletions
diff --git a/rfc.md b/rfc.md
index 1b090443..fd9397ad 100644
--- a/rfc.md
+++ b/rfc.md
@@ -1,6 +1,8 @@
 
 fatcat is a half-baked idea to build an open, independent, collaboratively
-editable bibliographic database of most written works.
+editable bibliographic database of most written works, with a focus on
+published research outputs like journal articles, pre-prints, and conference
+proceedings.
 
 ## Technical Architecture
 
@@ -11,7 +13,42 @@ embedded in this back-end; as much as possible would be pushed to bots which
 could be authored and operated by anybody. A separate web interface project
 would talk to the API backend and could be developed more rapidly.
 
-## Editing Workflow and Bots
+A cronjob would make periodic database dumps, both in "full" form (all tables
+and all edit history, removing only authentication credentials) and "flat" form
+(with only the most recent version of each entity, using only persistent IDs
+between entities).
+
+A goal is to be linked-data/RDF/JSON-LD/semantic-web "compatible", but not
+necessarily "first". It should be possible to export the database in a
+relatively clean RDF form, and to fetch data in a variety of formats, but
+internally fatcat would not be backed by a triple-store, and would not be
+bound to a specific third party ontology or schema.
+
+Microservice daemons should be able to proxy between the primary API and
+standard protocols like ResourceSync and OAI-PMH, and bots could consume
+external databases in those formats.
+
+## Licensing
+
+The core fatcat database should only contain verifyable factual statements
+(which isn't to say that all statements are "true"), not creative or derived
+content.
+
+The goal is to have a very permissively licensed database: CC-0 (no rights
+reserved) if possible. Under US law, it should be possible to scrape and pull
+in factual data from other corpuses without adopting their licenses. The goal
+here isn't to avoid all attibution (progeny information will be included, and a
+large sources and acknowledgements statement should be maintained), but trying
+to manage the intersection of all upstream source licenses seems untenable, and
+creates burdens for downstream users.
+
+Special care will need to be taken around copyright and original works. I would
+propose either not accepting abstracts at all, or including them in a
+partitioned database to prevent copyright contamination. Likewise, even simple
+user-created content like lists, reviews, ratings, comments, discussion,
+documentation, etc should go in separate services.
+
+## Basic Editing Workflow and Bots
 
 Both human editors and bots would have edits go through the same API, with
 humans using either the default web interface or arbitrary integrations or
@@ -45,11 +82,20 @@ separate stand-alone services for editors to propose projects and debate
 process or scope changes. It would be best if these could use federated account
 authorization (oauth?) to have consistent account IDs across mediums.
 
+## Edit Log
+
+As part of the process of "accepting" an edit group, a row would be written to
+an immutable, append-only log table (which internally could be a SQL table)
+documenting each identifier change. This log establishes a monotonically
+increasing version number for the entire corpus, and should make interaction
+with other systems easier (eg, search engines, replicated databases,
+alternative storage backends, notification frameworks, etc).
+
 ## Itentifiers
 
 A fixed number of first class "entities" would be definied, with common
 behavior and schema layouts. These would all be semantic entities like "work",
-"edition", "container", and "person".
+"release", "container", and "person".
 
 fatcat identifiers would be semanticly meaningless fixed length random numbers,
 usually represented in case-insensitive base32 format. Each entity type would
@@ -62,7 +108,8 @@ As a URL:
 
     https://fatcat.org/work/rzga5b9cd7efgh04iljk
 
-A 64 bit namespace is probably plenty though:
+A 64 bit namespace is probably plenty though, and would work with most databse
+Integer columns:
 
     fcwork_rzga5b9cd7efg
     https://fatcat.org/work/rzga5b9cd7efg
@@ -136,6 +183,7 @@ Probably in scope:
     government documents
     conference
     presentations (slides, video)
+    datasets
 
 Probably not:
 
@@ -152,3 +200,151 @@ Definitely not:
     musical scores
     advertisements
 
+Author, citation, and work disambiguation would be core tasks. Linking
+pre-prints to final publication is in scope.
+
+I'm much less interested in altmetrics, funding, and grant relationships than
+most existing databases in this space.
+
+fatcat would not include any fulltext content itself, even for cleanly licensed
+(open access) works, but would have "strong" (verified) links to fulltext
+content, and would include file-level metadata (like hashes and fingerprints)
+to help discovery and identify content from any source. Typed file-level links
+should make fatcat more useful for both humans and machines to quickly access
+fulltext content of a given mimetype than existing redirect or landing page
+systems.
+
+## Ontology
+
+Loosely following FRBR, but removing the "manifestation" abstraction, and
+favoring files (digital artifacts) over physical items, the primary entities
+are:
+
+    work
+        type
+        <has> contributors
+        <about> subject/category
+        <has-primary> release
+
+    release (aka "edition", "variant")
+        title
+        volume/pages/issue/chapter
+        open-access status
+        <published> date
+        <of a> work
+        <published-by> publisher
+        <published in> container
+        <has> contributors
+        <citation> citetext <to> release
+        <has> identifier
+
+    file (aka "digital artifact")
+        <of a> release
+        <has> hashes
+        <found at> URLs
+        <held-at> institution <with> accession
+
+    creator
+        name
+        <has> aliases
+        <has> affiliation <for> date span
+        <has> identifier
+
+    container
+        name
+        open-access policy
+        peer-review policy
+        <has> identifier
+        <published in> container
+        <published-by> publisher
+
+    publisher
+        name
+        <has> identifier
+
+## Controlled Vocabularies
+
+Some special namespace tables and enums would probably be helpful; these should
+live in the database (not requiring a database migration to update), but should
+have more controlled editing workflow... perhaps versioned in the codebase:
+
+- identifier namespaces (DOI, ISBN, ISSN, ORCID, etc)
+- subject categorization
+- license and open access status
+- work types
+- contributor types (author, translator, illustrator, etc)
+
+## Unresolved Questions
+
+How to handle translations of, eg, titles and author names? To be clear, not
+translations of works (which are just separate releases).
+
+Are bi-directional links a schema anti-pattern? Eg, should "work" point to a
+primary "release" (which itself points back to the work), or should "release"
+have a "is-primary" flag?
+
+Should `identifier` and `citation` be their own entities, referencing other
+entities by UUID instead of by revision? This could save a ton of database
+space and chunder.
+
+Should creator/author contact information be retained? It could be very useful
+for disambiguation, but we don't want to build a huge database for spammers or
+"innovative" start-up marketing.
+
+Would general purpose SQL databases like Postgres or MySQL scale well enough
+told hold several tables with billions of entries? Right from the start there
+are hundreds of millions of works and releases, many of which having dozens of
+citations, many authors, and many identifiers, and then we'll have potentially
+dozens of edits for each of these, which multiply out to `1e8 * 2e1 * 2e1 =
+4e10`, or 40 billion rows in the citation table. If each row was 32 bytes on
+average (uncompressed, not including index size), that would be 1.3 TByte on
+it's own, larger than common SSD disk. I think a transactional SQL datastore is
+the right answer. In my experience locking and index rebuild times are usually
+the biggest scaling challenges; the largely-immutable architecture here should
+mitigate locking. Hopefully few indexes would be needed in the primary
+database, as user interfaces could rely on secondary read-only search engines
+for more complex queries and views.
+
+I see a tension between focus and scope creep. If a central database like
+fatcat doesn't support enough fields and metadata, then it will not be possible
+to completely import other corpuses, and this becomes "yet another" partial
+bibliographic database. On the other hand, accepting arbitrary data leads to
+other problems:
+
+## References and Previous Work
+
+The closest overall analog of fatcat is [MusicBrainz][mb], a collaboratively
+edited music database. [Open Library][] is a very similar existing service,
+which exclusively contains book metadata.
+
+[Wikidata][wd] seems to be the most successful and actively edited/developed
+open bibliographic database at this time (early 2018), including the
+[wikicite][wikicite] conference and related Wikimedia/Wikipedia projects.
+Wikidata is a general purpose semantic database of entities, facts, and
+relationships; bibliographic metadata has become a large fraction of all
+content in recent years. The focus there seems to be linking knowledge
+(statements) to specific sources unambigiously. Potential advantages fatcat
+would have would be a focus on a specific scope (not a general purpose database
+of entities) and a goal of completeness (capturing as many works and
+relationships as rapidly as possible). However, it might be better to just
+pitch in to the wikidata efforts.
+
+The technical design of fatcat is loosely inspired by the git
+branch/tag/commit/tree architecture, and specifically inspired by Oliver
+Charles' "New Edit System" [blog posts][nes-blog] from 2012.
+
+There are a whole bunch of proprietary, for-profit bibliographic databases,
+including Web of Science, Google Scholar, Microsoft Academic Graph, aminer,
+Scopus, and Dimensions. There are excellent field-limited databases like dblp,
+MEDLINE, and Semantic Scholar. There are some large general-purpose databases
+that are not directly user-editable, including the OpenCitation corpus, CORE,
+BASE, and CrossRef. I don't know of any large (more than 60 million works),
+open (bulk-downloadable with permissive or no license), field agnostic,
+user-editable corpus of scholarly publication bibliographic metadata.
+
+[nes-blog]: https://ocharles.org.uk/blog/posts/2012-07-10-nes-does-it-better-1.html
+[mb]: https://musicbrainz.org
+[ol]: https://openlibrary.org
+[wd]: https://wikidata.org
+[wikicite]: https://meta.wikimedia.org/wiki/WikiCite_2017
+
author	Bryan Newbold <bnewbold@robocracy.org>	2018-01-17 14:42:55 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2018-01-17 14:42:55 -0800
commit	f1338cf0aaaacda1305c76cfac6b43b55aae4fc8 (patch)
tree	b2fc4bef722b16a71cc09d39263f97426180f810
parent	027f1639ecf29f9e8d5e9b605e1b3ecb4e65139a (diff)
download	fatcat-f1338cf0aaaacda1305c76cfac6b43b55aae4fc8.tar.gz fatcat-f1338cf0aaaacda1305c76cfac6b43b55aae4fc8.zip