guide/src/goals.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


## Project Goals and Ecosystem Niche

The Internet Archive has two primary use cases for Fatcat:

- Tracking the "completeness" of our holdings against all known published
  works.  In particular, allow us to monitor progress, identify gaps, and
  prioritize further collection work.
- Be a public-facing catalog and access mechanism for our open access holdings.

In the larger ecosystem, Fatcat could also provide:

- A work-level (as opposed to title-level) archival dashboard: what fraction of
  all published works are preserved in archives? [KBART][], [CLOCKSS][],
  [Portico][], and other preservation networks don't provide granular metadata
- A collaborative, independent, non-commercial, fully-open, field-agnostic,
  "completeness"-oriented catalog of scholarly metadata
- Unified (centralized) foundation for discovery and access across repositories
  and archives: discovery projects can focus on user experience instead of
  building their own catalog from scratch
- Research corpus for meta-science, with an emphasis on availability and
  reproducibility (metadata corpus itself is open access, and file-level hashes
  control for content drift)
- Foundational infrastructure for distributed digital preservation
- On-ramp for non-traditional digital works (web-native and "grey literature")
  into the scholarly web

[KBART]: https://thekeepers.org/
[CLOCKSS]: https://clockss.org
[Portico]: http://www.portico.org

## Scope

What types of works should be included in the catalog?

The goal is to capture the "scholarly web": the graph of written works that
cite other works. Any work that is both cited more than once and cites more
than one other work in the catalog is likely to be in scope. "Leaf nodes" and
small islands of intra-cited works may or may not be in scope.

Fatcat does not include any fulltext content itself, even for clearly licensed
open access works, but does have verified hyperlinks to fulltext content, and
includes file-level metadata (hashes and fingerprints) to help identify content
from any source. File-level URLs with context ("repository", "publisher",
"webarchive") should make Fatcat more useful for both humans and machines to
quickly access fulltext content of a given mimetype than existing redirect or
landing page systems. So another factor in deciding scope is whether a work has
"digital fixity" and can be contained in immutable files or can be captured by
web archives.

## References and Previous Work

The closest overall analog of Fatcat is [MusicBrainz][mb], a collaboratively
edited music database. [Open Library][ol] is a very similar existing service,
which exclusively contains book metadata.

[Wikidata][wd] seems to be the most successful and actively edited/developed
open bibliographic database at this time (early 2018), including the
[wikicite][wikicite] conference and related Wikimedia/Wikipedia projects.
Wikidata is a general purpose semantic database of entities, facts, and
relationships; bibliographic metadata has become a large fraction of all
content in recent years. The focus there seems to be linking knowledge
(statements) to specific sources unambiguously. Potential advantages Fatcat has
are a focus on a specific scope (not a general-purpose database of entities)
and a goal of completeness (capturing as many works and relationships as
rapidly as possible). With so much overlap, the two efforts might merge in the
future.

The technical design of Fatcat is loosely inspired by the git
branch/tag/commit/tree architecture, and specifically inspired by Oliver
Charles' "New Edit System" [blog posts][nes-blog] from 2012.

There are a number of proprietary, for-profit bibliographic databases,
including Web of Science, Google Scholar, Microsoft Academic Graph, aminer,
Scopus, and Dimensions. There are excellent field-limited databases like dblp,
MEDLINE, and Semantic Scholar. Large, general-purpose databases also exist that
are not directly user-editable, including the OpenCitation corpus, CORE, BASE,
and CrossRef. We do not know of any large (more than 60 million works), open
(bulk-downloadable with permissive or no license), field agnostic,
user-editable corpus of scholarly publication bibliographic metadata.

[nes-blog]: https://ocharles.org.uk/blog/posts/2012-07-10-nes-does-it-better-1.html
[mb]: https://musicbrainz.org
[ol]: https://openlibrary.org
[wd]: https://wikidata.org
[wikicite]: https://meta.wikimedia.org/wiki/WikiCite_2017

## Further Reading

"From ISIS to CouchDB: Databases and Data Models for Bibliographic Records" by Luciano G. Ramalho. code4lib, 2013. <https://journal.code4lib.org/articles/4893>

"Representing bibliographic data in JSON". github README file, 2017. <https://github.com/rdmpage/bibliographic-metadata-json>

"Citation Style Language", <https://citationstyles.org/>

"Functional Requirements for Bibliographic Records", Wikipedia article, <https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records>

OpenCitations and I40C <http://opencitations.net/>, <https://i4oc.org/>