From da8911b029f06023d5d8f8aad3cc845583e6d708 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 20 Sep 2018 12:53:23 -0700 Subject: copy some notes to guide --- guide/src/overview.md | 101 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 100 insertions(+), 1 deletion(-) (limited to 'guide/src/overview.md') diff --git a/guide/src/overview.md b/guide/src/overview.md index bc08ce1e..8e6279ed 100644 --- a/guide/src/overview.md +++ b/guide/src/overview.md @@ -1,3 +1,102 @@ # Fatcat Overview -For now, see the [RFC](https://fatcat.wiki). +fatcat is an open bibliographic catalog of written works. The scope of works +is somewhat flexible, with a focus on published research outputs like journal +articles, pre-prints, and conference proceedings. Records are collaboratively +editable, versioned, available in bulk form, and include URL-agnostic +file-level metadata. + +fatcat is currently used internally at the Internet Archive, but interested +folks are welcome to contribute to design and development. + +## Goals and Ecosystem Niche + +For the Internet Archive use case, fatcat has two primary use cases: + +- Track the "completeness" of our holdings against all known published works. + In particular, allow us to monitor and prioritize further collection work. +- Be a public-facing catalog and access mechanism for our open access holdings. + +In the larger ecosystem, fatcat could also provide: + +- A work-level (as opposed to title-level) archival dashboard: what fraction of + all published works are preserved in archives? KBART, CLOCKSS, Portico, and + other preservations don't provide granular metadata +- A collaborative, independent, non-commercial, fully-open, field-agnostic, + "completeness"-oriented catalog of scholarly metadata +- Unified (centralized) foundation for discovery and access across repositories + and archives: discovery projects can focus on user experience instead of + building their own catalog from scratch +- Research corpus for meta-science, with an emphasis on availability and + reproducibility (metadata corpus itself is open access, and file-level hashes + control for content drift) +- Foundational infrastructure for distributed digital preservation +- On-ramp for non-traditional digital works ("grey literature") into the + scholarly web + +## Scope + +The goal is to capture the "scholarly web": the graph of written works that +cite other works. Any work that is both cited more than once and cites more +than one other work in the catalog is very likely to be in scope. "Leaf nodes" +and small islands of intra-cited works may or may not be in scope. + +fatcat would not include any fulltext content itself, even for cleanly licensed +(open access) works, but would have "strong" (verified) links to fulltext +content, and would include file-level metadata (like hashes and fingerprints) +to help discovery and identify content from any source. File-level URLs with +context ("repository", "author-homepage", "web-archive") should make fatcat +more useful for both humans and machines to quickly access fulltext content of +a given mimetype than existing redirect or landing page systems. So another +factor in deciding scope is whether a work has "digital fixity" and can be +contained in a single immutable file. + +## References and Previous Work + +The closest overall analog of fatcat is [MusicBrainz][mb], a collaboratively +edited music database. [Open Library][ol] is a very similar existing service, +which exclusively contains book metadata. + +[Wikidata][wd] seems to be the most successful and actively edited/developed +open bibliographic database at this time (early 2018), including the +[wikicite][wikicite] conference and related Wikimedia/Wikipedia projects. +Wikidata is a general purpose semantic database of entities, facts, and +relationships; bibliographic metadata has become a large fraction of all +content in recent years. The focus there seems to be linking knowledge +(statements) to specific sources unambiguously. Potential advantages fatcat +would have would be a focus on a specific scope (not a general-purpose database +of entities) and a goal of completeness (capturing as many works and +relationships as rapidly as possible). However, it might be better to just +pitch in to the wikidata efforts. + +The technical design of fatcat is loosely inspired by the git +branch/tag/commit/tree architecture, and specifically inspired by Oliver +Charles' "New Edit System" [blog posts][nes-blog] from 2012. + +There are a whole bunch of proprietary, for-profit bibliographic databases, +including Web of Science, Google Scholar, Microsoft Academic Graph, aminer, +Scopus, and Dimensions. There are excellent field-limited databases like dblp, +MEDLINE, and Semantic Scholar. There are some large general-purpose databases +that are not directly user-editable, including the OpenCitation corpus, CORE, +BASE, and CrossRef. I don't know of any large (more than 60 million works), +open (bulk-downloadable with permissive or no license), field agnostic, +user-editable corpus of scholarly publication bibliographic metadata. + +[nes-blog]: https://ocharles.org.uk/blog/posts/2012-07-10-nes-does-it-better-1.html +[mb]: https://musicbrainz.org +[ol]: https://openlibrary.org +[wd]: https://wikidata.org +[wikicite]: https://meta.wikimedia.org/wiki/WikiCite_2017 + +## Further Reading + +"From ISIS to CouchDB: Databases and Data Models for Bibliographic Records" by Luciano G. Ramalho. code4lib, 2013. + +"Representing bibliographic data in JSON". github README file, 2017. + +"Citation Style Language", + +"Functional Requirements for Bibliographic Records", Wikipedia article, + +OpenCitations and I40C , + -- cgit v1.2.3