summaryrefslogtreecommitdiffstats
path: root/python
diff options
context:
space:
mode:
Diffstat (limited to 'python')
-rw-r--r--python/fatcat_web/routes.py10
-rw-r--r--python/fatcat_web/templates/about.html275
-rw-r--r--python/fatcat_web/templates/rfc.html190
3 files changed, 291 insertions, 184 deletions
diff --git a/python/fatcat_web/routes.py b/python/fatcat_web/routes.py
index 9a52382e..7d9b708f 100644
--- a/python/fatcat_web/routes.py
+++ b/python/fatcat_web/routes.py
@@ -455,15 +455,19 @@ def page_server_down(e):
return render_template('503.html'), 503
@app.route('/', methods=['GET'])
-def homepage():
+def page_home():
return render_template('home.html')
@app.route('/about', methods=['GET'])
-def aboutpage():
+def page_about():
return render_template('about.html')
+@app.route('/rfc', methods=['GET'])
+def page_rfc():
+ return render_template('rfc.html')
+
@app.route('/search', methods=['GET'])
-def search_redirect():
+def page_search_redirect():
return redirect("/release/search")
@app.route('/robots.txt', methods=['GET'])
diff --git a/python/fatcat_web/templates/about.html b/python/fatcat_web/templates/about.html
index 85f100b7..89c07859 100644
--- a/python/fatcat_web/templates/about.html
+++ b/python/fatcat_web/templates/about.html
@@ -1,190 +1,103 @@
{% extends "base.html" %}
{% block body %}
-<h2 id="fatcat-design-document-rfc">fatcat Design Document (RFC)</h2>
-<p><em>Contact: Bryan Newbold <a href="mailto:bnewbold@archive.org">bnewbold@archive.org</a>. Last updated 2018-08-10</em></p>
-<p>fatcat is a proposed open bibliographic catalog of written works. The scope of works is somewhat flexible, with a focus on published research outputs like journal articles, pre-prints, and conference proceedings. Records are collaboratively editable, versioned, available in bulk form, and include URL-agnostic file-level metadata.</p>
-<p>fatcat is currently used internally at the Internet Archive, but interested folks are welcome to contribute to design and development.</p>
-<h2 id="goals-and-ecosystem-niche">Goals and Ecosystem Niche</h2>
-<p>For the Internet Archive use case, fatcat has two primary use cases:</p>
-<ul>
-<li>Track the &quot;completeness&quot; of our holdings against all known published works. In particular, allow us to monitor and prioritize further collection work.</li>
-<li>Be a public-facing catalog and access mechanism for our open access holdings.</li>
-</ul>
-<p>In the larger ecosystem, fatcat could also provide:</p>
-<ul>
-<li>A work-level (as opposed to title-level) archival dashboard: what fraction of all published works are preserved in archives? KBART, CLOCKSS, Portico, and other preservations don't provide granular metadata</li>
-<li>A collaborative, independent, non-commercial, fully-open, field-agnostic, &quot;completeness&quot;-oriented catalog of scholarly metadata</li>
-<li>Unified (centralized) foundation for discovery and access across repositories and archives: discovery projects can focus on user experience instead of building their own catalog from scratch</li>
-<li>Research corpus for meta-science, with an emphasis on availability and reproducibility (metadata corpus itself is open access, and file-level hashes control for content drift)</li>
-<li>Foundational infrastructure for distributed digital preservation</li>
-<li>On-ramp for non-traditional digital works (&quot;grey literature&quot;) into the scholarly web</li>
-</ul>
-<h2 id="technical-architecture">Technical Architecture</h2>
-<p>The canonical backend datastore exposes a microservice-like HTTP API, which could be extended with gRPC or GraphQL interfaces. The initial datastore is a transactional SQL database, but this implementation detail is abstracted by the API.</p>
-<p>As little &quot;application logic&quot; as possible should be embedded in this back-end; as much as possible would be pushed to bots which could be authored and operated by anybody. A separate web interface project talks to the API backend and can be developed more rapidly with less concern about data loss or corruption.</p>
-<p>A cronjob will creae periodic database dumps, both in &quot;full&quot; form (all tables and all edit history, removing only authentication credentials) and &quot;flattened&quot; form (with only the most recent version of each entity).</p>
-<p>A goal is to be linked-data/RDF/JSON-LD/semantic-web &quot;compatible&quot;, but not necessarily &quot;first&quot;. It should be possible to export the database in a relatively clean RDF form, and to fetch data in a variety of formats, but internally fatcat will not be backed by a triple-store, and will not be bound to a rigid third-party ontology or schema.</p>
-<p>Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the databse in those formats.</p>
-<h2 id="licensing">Licensing</h2>
-<p>The core fatcat database should only contain verifiable factual statements (which isn't to say that all statements are &quot;true&quot;), not creative or derived content.</p>
-<p>The goal is to have a very permissively licensed database: CC-0 (no rights reserved) if possible. Under US law, it should be possible to scrape and pull in factual data from other corpuses without adopting their licenses. The goal here isn't to avoid attribution (progeny information will be included, and a large sources and acknowledgments statement should be maintained and shipped with bulk exports), but trying to manage the intersection of all upstream source licenses seems untenable, and creates burdens for downstream users and developers.</p>
-<p>Special care will need to be taken around copyright, &quot;original work&quot; by editors, and contributions that raise privacy concerns. If abstracts are stored at all, they should be in a partitioned database table to prevent copyright contamination. Likewise, even simple user-created content like lists, reviews, ratings, comments, discussion, documentation, etc., should live in separate services.</p>
-<h2 id="basic-editing-workflow-and-bots">Basic Editing Workflow and Bots</h2>
-<p>Both human editors and bots should have edits go through the same API, with humans using either the default web interface, integrations, or client software.</p>
-<p>The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an &quot;edit group&quot; of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor would &quot;submit&quot; the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (72 hours?) with no changes and no blocking issues, the edit group would be auto-accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).</p>
-<p>Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts managable.</p>
-<p>Data progeny and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors would leave edit messages to clarify their sources.</p>
-<p>A style guide (wiki) and discussion forum would be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (oauth?) to have consistent account IDs across all mediums.</p>
-<h2 id="global-edit-changelog">Global Edit Changelog</h2>
-<p>As part of the process of &quot;accepting&quot; an edit group, a row would be written to an immutable, append-only log table (which internally could be a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).</p>
-<h2 id="identifiers">Identifiers</h2>
-<p>A fixed number of first-class &quot;entities&quot; are defined, with common behavior and schema layouts. These are all be semantic entities like &quot;work&quot;, &quot;release&quot;, &quot;container&quot;, and &quot;creator&quot;.</p>
-<p>fatcat identifiers are semantically meaningless fixed-length random numbers, usually represented in case-insensitive base32 format. Each entity type has its own identifier namespace.</p>
-<p>128-bit (UUID size) identifiers encode as 26 characters (but note that not all such strings decode to valid UUIDs), and in the backend can be serialized in UUID columns:</p>
-<pre><code>work_rzga5b9cd7efgh04iljk8f3jvz
-https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz</code></pre>
-<p>In comparison, 96-bit identifiers would have 20 characters and look like:</p>
-<pre><code>work_rzga5b9cd7efgh04iljk
-https://fatcat.wiki/work/rzga5b9cd7efgh04iljk</code></pre>
-<p>A 64-bit namespace would probably be large enought, and would work with database Integer columns:</p>
-<pre><code>work_rzga5b9cd7efg
-https://fatcat.wiki/work/rzga5b9cd7efg</code></pre>
-<p>The idea would be to only have fatcat identifiers be used to interlink between databases, <em>not</em> to supplant DOIs, ISBNs, handle, ARKs, and other &quot;registered&quot; persistent identifiers.</p>
-<h2 id="entities-and-internal-schema">Entities and Internal Schema</h2>
-<p>Internally, identifiers would be lightweight pointers to &quot;revisions&quot; of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems, this is the git model, not the mercurial model.</p>
-<p>The entity revisions are immutable once accepted; the editting process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by <em>identifier</em> not <em>revision number</em>. Identifier pointers also support (versioned) deletion and redirects (for merging entities).</p>
-<p>Edit objects represent a change to a single entity; edits get batched together into edit groups (like &quot;commits&quot; and &quot;pull requests&quot; in git parlance).</p>
-<p>SQL tables would probably look something like the (but specific to each entity type, with tables like <code>work_revision</code> not <code>entity_revision</code>):</p>
-<pre><code>entity_ident
- id (uuid)
- current_revision (entity_revision foreign key)
- redirect_id (optional; points to another entity_ident)
-
-entity_revision
- revision_id
- &lt;entity-specific fields&gt;
- extra: json blob for schema evolution
-
-entity_edit
- timestamp
- editgroup_id
- ident (entity_ident foreign key)
- new_revision (entity_revision foreign key)
- previous_revision (optional; points to entity_revision)
- extra: json blob for progeny metadata
-
-editgroup
- editor_id
- description
- extra: json blob for progeny metadata</code></pre>
-<p>Additional entity-specific columns would hold actual metadata. Additional tables (which would reference both <code>entity_revision</code> and <code>entity_id</code> foreign keys as appropriate) would represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity would require duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object.</p>
-<h2 id="scope">Scope</h2>
-<p>The goal is to capture the &quot;scholarly web&quot;: the graph of written works that cite other works. Any work that is both cited more than once and cites more than one other work in the catalog is very likely to be in scope. &quot;Leaf nodes&quot; and small islands of intra-cited works may or may not be in scope.</p>
-<p>Overall focus is on written works, with some exceptions. The expected core focus (for which we would pursue &quot;completeness&quot;) is:</p>
-<pre><code>journal articles
-academic books
-conference proceedings
-technical memos
-dissertations
-monographs
-well-researched blog posts
-web pages (that have citations)
-&quot;white papers&quot;</code></pre>
-<p>Possibly in scope:</p>
-<pre><code>reports
-magazine articles
-essays
-notable mailing list postings
-government documents
-presentations (slides, video)
-datasets
-well-researched wiki pages
-patents</code></pre>
-<p>Probably not:</p>
-<pre><code>court cases and legal documents
-newspaper articles
-social media
-manuals
-datasheets
-courses
-published poetry</code></pre>
-<p>Definitely not:</p>
-<pre><code>audio recordings
-tv show episodes
-musical scores
-advertisements</code></pre>
-<p>Author, citation, and work disambiguation would be core tasks. Linking pre-prints to final publication is in scope.</p>
-<p>I'm much less interested in altmetrics, funding, and grant relationships than most existing databases in this space.</p>
-<p>fatcat would not include any fulltext content itself, even for cleanly licensed (open access) works, but would have &quot;strong&quot; (verified) links to fulltext content, and would include file-level metadata (like hashes and fingerprints) to help discovery and identify content from any source. File-level URLs with context (&quot;repository&quot;, &quot;author-homepage&quot;, &quot;web-archive&quot;) should make fatcat more useful for both humans and machines to quickly access fulltext content of a given mimetype than existing redirect or landing page systems. So another factor in deciding scope is whether a work has &quot;digital fixity&quot; and can be contained in a single immutable file.</p>
-<h2 id="ontology">Ontology</h2>
-<p>Loosely following FRBR (Functional Requirements for Bibliographic Records), but removing the &quot;manifestation&quot; abstraction, and favoring files (digital artifacts) over physical items, the primary entities are:</p>
-<pre><code>work
- &lt;a stub, for grouping releases&gt;
-
-release (aka &quot;edition&quot;, &quot;variant&quot;)
- title
- volume/pages/issue/chapter
- media/formfactor
- publication/peer-review status
- language
- &lt;published&gt; date
- &lt;variant-of&gt; work
- &lt;published-in&gt; container
- &lt;has-contributors&gt; creator
- &lt;citation-to&gt; release
- &lt;has&gt; identifier
-
-file (aka &quot;digital artifact&quot;)
- &lt;instantiates&gt; release
- hashes/checksums
- mimetype
- &lt;found-at&gt; URLs
-
-creator (aka &quot;author&quot;)
- name
- identifiers
- aliases
-
-container (aka &quot;venue&quot;, &quot;serial&quot;, &quot;title&quot;)
- name
- open-access policy
- peer-review policy
- &lt;has&gt; aliases, acronyms
- &lt;about&gt; subject/category
- &lt;has&gt; identifier
- &lt;published-in&gt; container
- &lt;published-by&gt; publisher</code></pre>
-<h2 id="controlled-vocabularies">Controlled Vocabularies</h2>
-<p>Some special namespace tables and enums would probably be helpful; these could live in the database (not requiring a database migration to update), but should have more controlled editing workflow... perhaps versioned in the codebase:</p>
+<h1>About Fatcat</h1>
+
+<p>This is versioned, public-editable catalog of research publications: journal
+articles, conference proceedings, pre-prints, blog posts, and so forth. The
+goal is to improve the state of preservation and access to these works by
+providing a manifest of file content versions and locations. This service does
+not directly contain full-text content itself, but provides basic access for human
+and machine readers through links to copies in web archives, institutional and
+other repositories, and the public web.
+
+<p>A few features set fatcat apart from similar indexing and discovery services
+in the scholarly communications space:
+
<ul>
-<li>identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers themselves)</li>
-<li>subject categorization</li>
-<li>license and open access status</li>
-<li>work &quot;types&quot; (article vs. book chapter vs. proceeding, etc)</li>
-<li>contributor types (author, translator, illustrator, etc)</li>
-<li>human languages</li>
-<li>file mimetypes</li>
+ <li>inclusion of archival file-level metadata (content digests) in addition
+ to URLs, which allows automated verification ("do I have the right copy"),
+ reveals content-drift over time, and enables efficient distribution of
+ content through the ecosystem
+ <li>native support for "post-PDF" digital media, including archival web
+ captures and datasets, as well as content stored on the distributed web
+ <li>data model that captures the work/edition (aka, "release") distinction,
+ grouping pre-print, post-review, published, re-published, and updated
+ versions of a work together
+ <li>public editing interface, allowing metadata corrections and improvements
+ from individuals and bots in addition to automated imports from authoritative
+ sources
+ <li>focus on providing a stable API and corpus (making integration with
+ diverse user-facing applications simple), while enabling full replication and
+ mirroring of the corpus to reduce the risks of centralized control
</ul>
-<p>These could also be enforced by QA bots that review all editgroups.</p>
-<h2 id="unresolved-questions">Unresolved Questions</h2>
-<p>How to handle translations of, eg, titles and author names? To be clear, not translations of works (which are just separate releases), these are more like aliases or &quot;originally known as&quot;.</p>
-<p>Are bi-directional links a schema anti-pattern? Eg, should &quot;work&quot; point to a &quot;primary release&quot; (which itself points back to the work)?</p>
-<p>Should <code>identifier</code> and <code>citation</code> be their own entities, referencing other entities by UUID instead of by revision? Not sure if this would increase or decrease database resource utilization.</p>
-<p>Should contributor/author affiliation and contact information be retained? It could be very useful for disambiguation, but we don't want to build a huge database for spammers or &quot;innovative&quot; start-up marketing.</p>
-<p>Can general-purpose SQL databases like Postgres or MySQL scale well enough to hold several tables with billions of entity revisions? Right from the start there are hundreds of millions of works and releases, many of which having dozens of citations, many authors, and many identifiers, and then we'll have potentially dozens of edits for each of these, which multiply out to <code>1e8 * 2e1 * 2e1 = 4e10</code>, or 40 billion rows in the citation table. If each row was 32 bytes on average (uncompressed, not including index size), that would be 1.3 TByte on its own, larger than common SSD disks. I do think a transactional SQL datastore is the right answer. In my experience locking and index rebuild times are usually the biggest scaling challenges; the largely-immutable architecture here should mitigate locking. Hopefully few indexes would be needed in the primary database, as user interfaces could rely on secondary read-only search engines for more complex queries and views.</p>
-<p>I see a tension between focus and scope creep. If a central database like fatcat doesn't support enough fields and metadata, then it will not be possible to completely import other corpuses, and this becomes &quot;yet another&quot; partial bibliographic database. On the other hand, accepting arbitrary data leads to other problems: sparseness increases (we have more &quot;partial&quot; data), potential for redundancy is high, humans will start editing content that might be bulk-replaced, etc.</p>
-<p>There might be a need to support &quot;stub&quot; references between entities. Eg, when adding citations from PDF extraction, the cited works are likely to be ambiguous. Could create &quot;stub&quot; works to be merged/resolved later, or could leave the citation hanging. Same with authors, containers (journals), etc.</p>
-<h2 id="references-and-previous-work">References and Previous Work</h2>
-<p>The closest overall analog of fatcat is <a href="https://musicbrainz.org">MusicBrainz</a>, a collaboratively edited music database. <a href="https://openlibrary.org">Open Library</a> is a very similar existing service, which exclusively contains book metadata.</p>
-<p><a href="https://wikidata.org">Wikidata</a> seems to be the most successful and actively edited/developed open bibliographic database at this time (early 2018), including the <a href="https://meta.wikimedia.org/wiki/WikiCite_2017">wikicite</a> conference and related Wikimedia/Wikipedia projects. Wikidata is a general purpose semantic database of entities, facts, and relationships; bibliographic metadata has become a large fraction of all content in recent years. The focus there seems to be linking knowledge (statements) to specific sources unambiguously. Potential advantages fatcat would have would be a focus on a specific scope (not a general-purpose database of entities) and a goal of completeness (capturing as many works and relationships as rapidly as possible). However, it might be better to just pitch in to the wikidata efforts.</p>
-<p>The technical design of fatcat is loosely inspired by the git branch/tag/commit/tree architecture, and specifically inspired by Oliver Charles' &quot;New Edit System&quot; <a href="https://ocharles.org.uk/blog/posts/2012-07-10-nes-does-it-better-1.html">blog posts</a> from 2012.</p>
-<p>There are a whole bunch of proprietary, for-profit bibliographic databases, including Web of Science, Google Scholar, Microsoft Academic Graph, aminer, Scopus, and Dimensions. There are excellent field-limited databases like dblp, MEDLINE, and Semantic Scholar. There are some large general-purpose databases that are not directly user-editable, including the OpenCitation corpus, CORE, BASE, and CrossRef. I don't know of any large (more than 60 million works), open (bulk-downloadable with permissive or no license), field agnostic, user-editable corpus of scholarly publication bibliographic metadata.</p>
-<h2 id="rfc-changelog">RFC Changelog</h2>
+
+<p>This service aspires to be a piece of sustainable, long-term, non-profit,
+free-software, collaborative, open digital infrastructure. It is primarily
+designed to support the <i>archival</i> and <i>dissemination</i> (in terms of
+access) roles of scholarly communication. It may also support the
+<i>registration</i> role (establishing precedence and authorship), but
+explicitly does not aid with <i>certification</i> of content (particularly
+curation; this service is "universal" and happily includes retracted and
+"predatory" content), and is not intended to be used for <i>evaluation</i> of
+individuals, institutions, or venues.
+
+<h3>Sources of Metadata</h3>
+
+The source of all bibliographic information is recorded in edit history
+metadata, which allows the progeny of all fields to be reconstructed. A few
+major sources are worth highlighting here:
+
<ul>
-<li><strong>2017-12-16</strong>: early notes</li>
-<li><strong>2018-01-17</strong>: initial RFC document</li>
-<li><strong>2018-08-10</strong>: updates from implementation work</li>
+ <li>Release metadata from <b>Crossref</b> (a major non-profit DOI registrar), via their public
+ <a href="https://github.com/CrossRef/rest-api-doc">REST API</a>
+ <li>Release metadata and linked fulltext content from NIH <b>Pubmed</b> (a US national repository) and <b><a href="https://arxiv.org">arXiv.org</a></b> (a large pre-print repository hosted at Cornell University)
+ <li>Release metadata and linked public domain fulltext content the <b>JSTOR</b> Early Journal Content collection
+ <li>Creator (author) names and de-duplication from <b>ORCID</b> (an author identifier service), via their annual public data releases
+ <li>Journal title metadata from <b>DOAJ</b>, <b>ISSN ROAD</b>, and <b>SHERPA/RoMEO</b>
+ <li>Full-text URL lists from <b><a href="https://core.ac.uk">CORE</a></b>,
+ <b><a href="http://unpaywall.org">Unpaywall</a></b>,
+ <b><a href="https://www.semanticscholar.org">Semantic Scholar</a></b>,
+ <b><a href="https://citeseerx.ist.psu.edu">CiteseerX</a></b>,
+ and <b><a href="https://www.microsoft.com/en-us/research/project/academic">Microsoft Academic Graph</a></b>.
+ <li>The <a href="https://guide.{{ config.FATCAT_DOMAIN }}/sources.html">guide</a> lists more major sources
</ul>
+Many thanks for the hard work of all these projects, institutions, and
+individuals!
+
+
+<h3>Support and Acknowledgments</h3>
+
+<p>Fatcat is a project of the <b><a href="https://archive.org">Internet Archive</a></b>,
+a US-based non-profit digital library, well known for it's
+<a href="https://web.archive.org">Wayback Machine</a> web archive and
+<a href="https://openlibrary.org">Open Library</a> book digitization and
+lending service. All fatcat databases and services run on Internet Archive
+servers in California, and a copy of most fulltext content is stored on the
+Archive's collections and/or web archives.
+
+<p>Development of fatcat and related web harvesting, indexing, and preservation
+efforts at the Archive have been partially funded (for the 2018-2019 period) by
+a generous grant from the <b>Mellon Foundation</b>
+(<a href="https://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/">"Long-tail Open Access Journal Preservation"</a>).
+Fatcat supports this work both by tracking which open access works are not
+getting preserved in any known archive, and providing minimum-viable indexing
+and access mechanisms for long-tail works which otherwise would lack them.
+
+<p>The service would not technically be possible without hundreds of free
+software components and the efforts of their individual and organizational
+maintainers, more than can be listed here (but see the source code for full
+lists). A few major components include the PostgreSQL database, Elasticsearch
+search engine, Flask python web framework, Rust programming language, Diesel
+database library, Swagger/OpenAPI code generators, Kafka distributed log,
+Ansible configuration management tool, and Ubuntu GNU/Linux operating system
+distribution.
+
+<p>The front-page photo of a large feline with a cup of coffee is CC-0
+licensed. The name "fat cat" can be interpreted as short for "large catalog",
+as the service aspires to be a <i>universal</i> (complete) catalog of the
+digital scholarly record.
+<!-- TODO: attribution anyways -->
+
{% endblock %}
diff --git a/python/fatcat_web/templates/rfc.html b/python/fatcat_web/templates/rfc.html
new file mode 100644
index 00000000..85f100b7
--- /dev/null
+++ b/python/fatcat_web/templates/rfc.html
@@ -0,0 +1,190 @@
+{% extends "base.html" %}
+{% block body %}
+
+<h2 id="fatcat-design-document-rfc">fatcat Design Document (RFC)</h2>
+<p><em>Contact: Bryan Newbold <a href="mailto:bnewbold@archive.org">bnewbold@archive.org</a>. Last updated 2018-08-10</em></p>
+<p>fatcat is a proposed open bibliographic catalog of written works. The scope of works is somewhat flexible, with a focus on published research outputs like journal articles, pre-prints, and conference proceedings. Records are collaboratively editable, versioned, available in bulk form, and include URL-agnostic file-level metadata.</p>
+<p>fatcat is currently used internally at the Internet Archive, but interested folks are welcome to contribute to design and development.</p>
+<h2 id="goals-and-ecosystem-niche">Goals and Ecosystem Niche</h2>
+<p>For the Internet Archive use case, fatcat has two primary use cases:</p>
+<ul>
+<li>Track the &quot;completeness&quot; of our holdings against all known published works. In particular, allow us to monitor and prioritize further collection work.</li>
+<li>Be a public-facing catalog and access mechanism for our open access holdings.</li>
+</ul>
+<p>In the larger ecosystem, fatcat could also provide:</p>
+<ul>
+<li>A work-level (as opposed to title-level) archival dashboard: what fraction of all published works are preserved in archives? KBART, CLOCKSS, Portico, and other preservations don't provide granular metadata</li>
+<li>A collaborative, independent, non-commercial, fully-open, field-agnostic, &quot;completeness&quot;-oriented catalog of scholarly metadata</li>
+<li>Unified (centralized) foundation for discovery and access across repositories and archives: discovery projects can focus on user experience instead of building their own catalog from scratch</li>
+<li>Research corpus for meta-science, with an emphasis on availability and reproducibility (metadata corpus itself is open access, and file-level hashes control for content drift)</li>
+<li>Foundational infrastructure for distributed digital preservation</li>
+<li>On-ramp for non-traditional digital works (&quot;grey literature&quot;) into the scholarly web</li>
+</ul>
+<h2 id="technical-architecture">Technical Architecture</h2>
+<p>The canonical backend datastore exposes a microservice-like HTTP API, which could be extended with gRPC or GraphQL interfaces. The initial datastore is a transactional SQL database, but this implementation detail is abstracted by the API.</p>
+<p>As little &quot;application logic&quot; as possible should be embedded in this back-end; as much as possible would be pushed to bots which could be authored and operated by anybody. A separate web interface project talks to the API backend and can be developed more rapidly with less concern about data loss or corruption.</p>
+<p>A cronjob will creae periodic database dumps, both in &quot;full&quot; form (all tables and all edit history, removing only authentication credentials) and &quot;flattened&quot; form (with only the most recent version of each entity).</p>
+<p>A goal is to be linked-data/RDF/JSON-LD/semantic-web &quot;compatible&quot;, but not necessarily &quot;first&quot;. It should be possible to export the database in a relatively clean RDF form, and to fetch data in a variety of formats, but internally fatcat will not be backed by a triple-store, and will not be bound to a rigid third-party ontology or schema.</p>
+<p>Microservice daemons should be able to proxy between the primary API and standard protocols like ResourceSync and OAI-PMH, and third party bots could ingest or synchronize the databse in those formats.</p>
+<h2 id="licensing">Licensing</h2>
+<p>The core fatcat database should only contain verifiable factual statements (which isn't to say that all statements are &quot;true&quot;), not creative or derived content.</p>
+<p>The goal is to have a very permissively licensed database: CC-0 (no rights reserved) if possible. Under US law, it should be possible to scrape and pull in factual data from other corpuses without adopting their licenses. The goal here isn't to avoid attribution (progeny information will be included, and a large sources and acknowledgments statement should be maintained and shipped with bulk exports), but trying to manage the intersection of all upstream source licenses seems untenable, and creates burdens for downstream users and developers.</p>
+<p>Special care will need to be taken around copyright, &quot;original work&quot; by editors, and contributions that raise privacy concerns. If abstracts are stored at all, they should be in a partitioned database table to prevent copyright contamination. Likewise, even simple user-created content like lists, reviews, ratings, comments, discussion, documentation, etc., should live in separate services.</p>
+<h2 id="basic-editing-workflow-and-bots">Basic Editing Workflow and Bots</h2>
+<p>Both human editors and bots should have edits go through the same API, with humans using either the default web interface, integrations, or client software.</p>
+<p>The normal workflow is to create edits (or updates, merges, deletions) on individual entities. Individual changes are bundled into an &quot;edit group&quot; of related edits (eg, correcting authorship info for multiple works related to a single author). When ready, the editor would &quot;submit&quot; the edit group for review. During the review period, human editors vote and bots can perform automated checks. During this period the editor can make tweaks if necessary. After some fixed time period (72 hours?) with no changes and no blocking issues, the edit group would be auto-accepted if no merge conflicts have be created by other edits to the same entities. This process balances editing labor (reviews are easy, but optional) against quality (cool-down period makes it easier to detect and prevent spam or out-of-control bots). More sophisticated roles and permissions could allow some certain humans and bots to push through edits more rapidly (eg, importing new works from a publisher API).</p>
+<p>Bots need to be tuned to have appropriate edit group sizes (eg, daily batches, instead of millions of works in a single edit) to make human QA review and reverts managable.</p>
+<p>Data progeny and source references are captured in the edit metadata, instead of being encoded in the entity data model itself. In the case of importing external databases, the expectation is that special-purpose bot accounts are be used, and tag timestamps and external identifiers in the edit metadata. Human editors would leave edit messages to clarify their sources.</p>
+<p>A style guide (wiki) and discussion forum would be hosted as separate stand-alone services for editors to propose projects and debate process or scope changes. These services should have unified accounts and logins (oauth?) to have consistent account IDs across all mediums.</p>
+<h2 id="global-edit-changelog">Global Edit Changelog</h2>
+<p>As part of the process of &quot;accepting&quot; an edit group, a row would be written to an immutable, append-only log table (which internally could be a SQL table) documenting each identifier change. This changelog establishes a monotonically increasing version number for the entire corpus, and should make interaction with other systems easier (eg, search engines, replicated databases, alternative storage backends, notification frameworks, etc.).</p>
+<h2 id="identifiers">Identifiers</h2>
+<p>A fixed number of first-class &quot;entities&quot; are defined, with common behavior and schema layouts. These are all be semantic entities like &quot;work&quot;, &quot;release&quot;, &quot;container&quot;, and &quot;creator&quot;.</p>
+<p>fatcat identifiers are semantically meaningless fixed-length random numbers, usually represented in case-insensitive base32 format. Each entity type has its own identifier namespace.</p>
+<p>128-bit (UUID size) identifiers encode as 26 characters (but note that not all such strings decode to valid UUIDs), and in the backend can be serialized in UUID columns:</p>
+<pre><code>work_rzga5b9cd7efgh04iljk8f3jvz
+https://fatcat.wiki/work/rzga5b9cd7efgh04iljk8f3jvz</code></pre>
+<p>In comparison, 96-bit identifiers would have 20 characters and look like:</p>
+<pre><code>work_rzga5b9cd7efgh04iljk
+https://fatcat.wiki/work/rzga5b9cd7efgh04iljk</code></pre>
+<p>A 64-bit namespace would probably be large enought, and would work with database Integer columns:</p>
+<pre><code>work_rzga5b9cd7efg
+https://fatcat.wiki/work/rzga5b9cd7efg</code></pre>
+<p>The idea would be to only have fatcat identifiers be used to interlink between databases, <em>not</em> to supplant DOIs, ISBNs, handle, ARKs, and other &quot;registered&quot; persistent identifiers.</p>
+<h2 id="entities-and-internal-schema">Entities and Internal Schema</h2>
+<p>Internally, identifiers would be lightweight pointers to &quot;revisions&quot; of an entity. Revisions are stored in their complete form, not as a patch or difference; if comparing to distributed version control systems, this is the git model, not the mercurial model.</p>
+<p>The entity revisions are immutable once accepted; the editting process involves the creation of new entity revisions and, if the edit is approved, pointing the identifier to the new revision. Entities cross-reference between themselves by <em>identifier</em> not <em>revision number</em>. Identifier pointers also support (versioned) deletion and redirects (for merging entities).</p>
+<p>Edit objects represent a change to a single entity; edits get batched together into edit groups (like &quot;commits&quot; and &quot;pull requests&quot; in git parlance).</p>
+<p>SQL tables would probably look something like the (but specific to each entity type, with tables like <code>work_revision</code> not <code>entity_revision</code>):</p>
+<pre><code>entity_ident
+ id (uuid)
+ current_revision (entity_revision foreign key)
+ redirect_id (optional; points to another entity_ident)
+
+entity_revision
+ revision_id
+ &lt;entity-specific fields&gt;
+ extra: json blob for schema evolution
+
+entity_edit
+ timestamp
+ editgroup_id
+ ident (entity_ident foreign key)
+ new_revision (entity_revision foreign key)
+ previous_revision (optional; points to entity_revision)
+ extra: json blob for progeny metadata
+
+editgroup
+ editor_id
+ description
+ extra: json blob for progeny metadata</code></pre>
+<p>Additional entity-specific columns would hold actual metadata. Additional tables (which would reference both <code>entity_revision</code> and <code>entity_id</code> foreign keys as appropriate) would represent things like authorship relationships (creator/release), citations between works, etc. Every revision of an entity would require duplicating all of these associated rows, which could end up being a large source of inefficiency, but is necessary to represent the full history of an object.</p>
+<h2 id="scope">Scope</h2>
+<p>The goal is to capture the &quot;scholarly web&quot;: the graph of written works that cite other works. Any work that is both cited more than once and cites more than one other work in the catalog is very likely to be in scope. &quot;Leaf nodes&quot; and small islands of intra-cited works may or may not be in scope.</p>
+<p>Overall focus is on written works, with some exceptions. The expected core focus (for which we would pursue &quot;completeness&quot;) is:</p>
+<pre><code>journal articles
+academic books
+conference proceedings
+technical memos
+dissertations
+monographs
+well-researched blog posts
+web pages (that have citations)
+&quot;white papers&quot;</code></pre>
+<p>Possibly in scope:</p>
+<pre><code>reports
+magazine articles
+essays
+notable mailing list postings
+government documents
+presentations (slides, video)
+datasets
+well-researched wiki pages
+patents</code></pre>
+<p>Probably not:</p>
+<pre><code>court cases and legal documents
+newspaper articles
+social media
+manuals
+datasheets
+courses
+published poetry</code></pre>
+<p>Definitely not:</p>
+<pre><code>audio recordings
+tv show episodes
+musical scores
+advertisements</code></pre>
+<p>Author, citation, and work disambiguation would be core tasks. Linking pre-prints to final publication is in scope.</p>
+<p>I'm much less interested in altmetrics, funding, and grant relationships than most existing databases in this space.</p>
+<p>fatcat would not include any fulltext content itself, even for cleanly licensed (open access) works, but would have &quot;strong&quot; (verified) links to fulltext content, and would include file-level metadata (like hashes and fingerprints) to help discovery and identify content from any source. File-level URLs with context (&quot;repository&quot;, &quot;author-homepage&quot;, &quot;web-archive&quot;) should make fatcat more useful for both humans and machines to quickly access fulltext content of a given mimetype than existing redirect or landing page systems. So another factor in deciding scope is whether a work has &quot;digital fixity&quot; and can be contained in a single immutable file.</p>
+<h2 id="ontology">Ontology</h2>
+<p>Loosely following FRBR (Functional Requirements for Bibliographic Records), but removing the &quot;manifestation&quot; abstraction, and favoring files (digital artifacts) over physical items, the primary entities are:</p>
+<pre><code>work
+ &lt;a stub, for grouping releases&gt;
+
+release (aka &quot;edition&quot;, &quot;variant&quot;)
+ title
+ volume/pages/issue/chapter
+ media/formfactor
+ publication/peer-review status
+ language
+ &lt;published&gt; date
+ &lt;variant-of&gt; work
+ &lt;published-in&gt; container
+ &lt;has-contributors&gt; creator
+ &lt;citation-to&gt; release
+ &lt;has&gt; identifier
+
+file (aka &quot;digital artifact&quot;)
+ &lt;instantiates&gt; release
+ hashes/checksums
+ mimetype
+ &lt;found-at&gt; URLs
+
+creator (aka &quot;author&quot;)
+ name
+ identifiers
+ aliases
+
+container (aka &quot;venue&quot;, &quot;serial&quot;, &quot;title&quot;)
+ name
+ open-access policy
+ peer-review policy
+ &lt;has&gt; aliases, acronyms
+ &lt;about&gt; subject/category
+ &lt;has&gt; identifier
+ &lt;published-in&gt; container
+ &lt;published-by&gt; publisher</code></pre>
+<h2 id="controlled-vocabularies">Controlled Vocabularies</h2>
+<p>Some special namespace tables and enums would probably be helpful; these could live in the database (not requiring a database migration to update), but should have more controlled editing workflow... perhaps versioned in the codebase:</p>
+<ul>
+<li>identifier namespaces (DOI, ISBN, ISSN, ORCID, etc; but not the identifers themselves)</li>
+<li>subject categorization</li>
+<li>license and open access status</li>
+<li>work &quot;types&quot; (article vs. book chapter vs. proceeding, etc)</li>
+<li>contributor types (author, translator, illustrator, etc)</li>
+<li>human languages</li>
+<li>file mimetypes</li>
+</ul>
+<p>These could also be enforced by QA bots that review all editgroups.</p>
+<h2 id="unresolved-questions">Unresolved Questions</h2>
+<p>How to handle translations of, eg, titles and author names? To be clear, not translations of works (which are just separate releases), these are more like aliases or &quot;originally known as&quot;.</p>
+<p>Are bi-directional links a schema anti-pattern? Eg, should &quot;work&quot; point to a &quot;primary release&quot; (which itself points back to the work)?</p>
+<p>Should <code>identifier</code> and <code>citation</code> be their own entities, referencing other entities by UUID instead of by revision? Not sure if this would increase or decrease database resource utilization.</p>
+<p>Should contributor/author affiliation and contact information be retained? It could be very useful for disambiguation, but we don't want to build a huge database for spammers or &quot;innovative&quot; start-up marketing.</p>
+<p>Can general-purpose SQL databases like Postgres or MySQL scale well enough to hold several tables with billions of entity revisions? Right from the start there are hundreds of millions of works and releases, many of which having dozens of citations, many authors, and many identifiers, and then we'll have potentially dozens of edits for each of these, which multiply out to <code>1e8 * 2e1 * 2e1 = 4e10</code>, or 40 billion rows in the citation table. If each row was 32 bytes on average (uncompressed, not including index size), that would be 1.3 TByte on its own, larger than common SSD disks. I do think a transactional SQL datastore is the right answer. In my experience locking and index rebuild times are usually the biggest scaling challenges; the largely-immutable architecture here should mitigate locking. Hopefully few indexes would be needed in the primary database, as user interfaces could rely on secondary read-only search engines for more complex queries and views.</p>
+<p>I see a tension between focus and scope creep. If a central database like fatcat doesn't support enough fields and metadata, then it will not be possible to completely import other corpuses, and this becomes &quot;yet another&quot; partial bibliographic database. On the other hand, accepting arbitrary data leads to other problems: sparseness increases (we have more &quot;partial&quot; data), potential for redundancy is high, humans will start editing content that might be bulk-replaced, etc.</p>
+<p>There might be a need to support &quot;stub&quot; references between entities. Eg, when adding citations from PDF extraction, the cited works are likely to be ambiguous. Could create &quot;stub&quot; works to be merged/resolved later, or could leave the citation hanging. Same with authors, containers (journals), etc.</p>
+<h2 id="references-and-previous-work">References and Previous Work</h2>
+<p>The closest overall analog of fatcat is <a href="https://musicbrainz.org">MusicBrainz</a>, a collaboratively edited music database. <a href="https://openlibrary.org">Open Library</a> is a very similar existing service, which exclusively contains book metadata.</p>
+<p><a href="https://wikidata.org">Wikidata</a> seems to be the most successful and actively edited/developed open bibliographic database at this time (early 2018), including the <a href="https://meta.wikimedia.org/wiki/WikiCite_2017">wikicite</a> conference and related Wikimedia/Wikipedia projects. Wikidata is a general purpose semantic database of entities, facts, and relationships; bibliographic metadata has become a large fraction of all content in recent years. The focus there seems to be linking knowledge (statements) to specific sources unambiguously. Potential advantages fatcat would have would be a focus on a specific scope (not a general-purpose database of entities) and a goal of completeness (capturing as many works and relationships as rapidly as possible). However, it might be better to just pitch in to the wikidata efforts.</p>
+<p>The technical design of fatcat is loosely inspired by the git branch/tag/commit/tree architecture, and specifically inspired by Oliver Charles' &quot;New Edit System&quot; <a href="https://ocharles.org.uk/blog/posts/2012-07-10-nes-does-it-better-1.html">blog posts</a> from 2012.</p>
+<p>There are a whole bunch of proprietary, for-profit bibliographic databases, including Web of Science, Google Scholar, Microsoft Academic Graph, aminer, Scopus, and Dimensions. There are excellent field-limited databases like dblp, MEDLINE, and Semantic Scholar. There are some large general-purpose databases that are not directly user-editable, including the OpenCitation corpus, CORE, BASE, and CrossRef. I don't know of any large (more than 60 million works), open (bulk-downloadable with permissive or no license), field agnostic, user-editable corpus of scholarly publication bibliographic metadata.</p>
+<h2 id="rfc-changelog">RFC Changelog</h2>
+<ul>
+<li><strong>2017-12-16</strong>: early notes</li>
+<li><strong>2018-01-17</strong>: initial RFC document</li>
+<li><strong>2018-08-10</strong>: updates from implementation work</li>
+</ul>
+
+{% endblock %}