\documentclass[hidelinks,10pt,twocolumn]{article}
\usepackage{simpleConference}
\usepackage[utf8]{inputenc}
\usepackage{times}
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{doi}
\usepackage{amssymb}
\usepackage{url,hyperref}
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{caption}

\usepackage{datetime}
\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
\setlength{\parindent}{0pt}

\begin{document}

\title{Fatcat Reference Dataset}

\author{Martin Czygan \\
	\\
	Internet Archive \\
	San Francisco, California, USA \\
	martin@archive.org  \\
	\and
	Bryan Newbold \\
	\\
	Internet Archive \\
	San Francisco, California, USA \\
	bnewbold@archive.org  \\
	\\
}


\maketitle
\thispagestyle{empty}


\begin{abstract}
	As part of its scholarly data efforts, the Internet Archive releases a first version of a citation
	graph dataset, named \emph{refcat}, derived from scholarly publications and
	additional data sources. It is composed of data gathered by the fatcat
	cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale
	crawls targeting primary and secondary scholarly outputs, as well as metadata
	from the Open Library\footnote{\url{https://openlibrary.org}} project and
	Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
	graph consists of 1,323,423,672 citations. We release this dataset under a CC0
	Public Domain Dedication, accessible through an archive
	item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}. All
	code used in the derivation process is released under an MIT
	license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
\end{abstract}

\keywords{Citation Graph, Web Archiving}

\section{Introduction}


The Internet Archive releases a first version of a citation graph dataset
derived from a raw corpus of about 2.5B references gathered from metadata and
data obtained by PDF extraction tools such as
GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
metadata from Open Library and Wikipedia.
The goal of this report is to describe briefly the current contents and the
derivation of the dataset. We expect
this dataset to be iterated upon, with changes both in content and processing.

Modern citation indexes can be traced back to the early computing age, when
projects like the Science Citation Index (1955)\citep{garfield2007evolution}
were first devised, living on in existing commercial knowledge bases today.
Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
references\citep{shotton2013publishing}. Other notable early projects
include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
decade has seen the emergence of more openly available, large scale
citation projects like Microsoft Academic\citep{sinha2015overview} or the
Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
according to \citep{hutchins2021tipping} over 1B citations are publicly
available, marking a tipping point for this category of data.

\section{Related Work}

There are a few large scale citation dataset available today. COCI, the
``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
2021-07-29, it contains
1,094,394,688 citations across 65,835,422 bibliographic
resources\citep{peroni2020opencitations}.

The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
``a Wikimedia initiative to develop open citations and linked bibliographic
data to serve free knowledge'' continously adds citations to its database and
as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
publications\footnote{\url{http://wikicite.org/statistics.html}}.

Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of
entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
	\url{https://archive.org/details/mag-2021-06-07}}  the
\emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466
bibliographic entities.

Numerous other projects have been or are concerned with various aspects of
citation discovery and curation as part their feature set, among them Semantic
Scholar\citep{fricke2018semantic}, CiteSeerX\citep{li2006citeseerx} or Aminer\citep{tang2016aminer}.

As mentioned in \citep{hutchins2021tipping}, the number of openly available
citations is not expected to shrink in the future.


\section{Dataset}

We release the first version of the \emph{refcat} dataset in an format used
internally for storage and to serve queries (and which we call \emph{biblioref}
or \emph{bref} for short). The dataset includes metadata from fatcat, the
Open Library Project and inbound links from the English Wikipedia. The fatcat
project itself aggregates data from variety of open data sources, such as
Crossref\citep{crossref}, PubMed\citep{canese2013pubmed},
DataCite\citep{brase2009datacite}, DOAJ\citep{doaj}, dblp\citep{ley2002dblp} and others,
as well as metadata generated from analysis of data preserved at the Internet
Archive and active crawls of publication sites on the web.

The dataset is
integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users
to explore inbound and outbound references\cite{fatcatguidereferencegraph}.

The format records source and target (fatcat release and work) identifiers, a
few attributes from the metadata (such as year or release stage) as well as
information about the match status and provanance.

The dataset currently contains 1,323,423,672 citations across 76,327,662
entities (55,123,635 unique source and 60,244,206 unique target work
identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI
for both source and target).
The majority of matches - 1,250,523,321 - are established through identifier
based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are
established through fuzzy matching techniques.

The majority of citations between \emph{refcat} and COCI overlap, as can be
seen in~Table~\ref{table:cocicmp}.

\begin{table}[]
	\begin{center}
		\begin{tabular}{ll}
			\toprule
			\bf{Set}              & \bf{Count}    \\

			\midrule
			COCI (C)              & 1,094,394,688 \\
			\emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
			C $\cap$ R            & 1,007,539,966 \\
			C $\setminus$ R       & 86,854,309    \\
			R $\setminus$ C       & 295,884,246
		\end{tabular}
		\vspace*{2mm}
		\caption{Comparison between COCI and \emph{refcat-doi}, a subset of
			\emph{refcat} where entities have a known DOI. At least 50\% of the
			295,884,246 references only in \emph{refcat-doi} come from links
			recorded within a specific dataset provider (GBIF, DOI prefix:
			10.15468).}
		\label{table:cocicmp}
	\end{center}
\end{table}

% zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
% zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
% find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst


\section{System Design}

The constraints for the systems design are informed by the volume and the
variety of the data. The capability to run the whole graph derivation on a
single machine was a minor goal as well. In total, the raw inputs amount to a
few terabytes of textual content, mostly newline delimited JSON. More
importantly, while the number of data fields is low, certain schemas are very
partial with hundreds of different combinations of available field values found
in the raw reference data. This is most likely caused by aggregators passing on
reference data coming from hundreds of sources, each of which not necessarily
agreeing on a common granularity for citation data and from artifacts of
machine learning based structured data extraction tools.

Each combination of fields may require a slightly different processing path.
For example, references with an Arxiv identifier can be processed differently
from references with only a title. Over 50\% of the raw reference data comes
from a set of eight field set manifestations, as listed in
Table~\ref{table:fields}.

\begin{table}[]
	\begin{center}
		\begin{tabular}{ll}
			\toprule
			\bf{Fields}                                                                                     & \bf{Percentage} \\
			\midrule
			\multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$  U $\cdot$  V $\cdot$ Y}           & 14\%            \\
			\multicolumn{1}{l}{\textbf{DOI}}                                                                & 14\%            \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ U $\cdot$ V $\cdot$ Y}           & 4\%             \\
			\multicolumn{1}{l}{\textbf{PMID} $\cdot$ U}                                                     & 4\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ T $\cdot$ V $\cdot$ Y}           & 4\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y}                                                    & 4\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ V $\cdot$ Y}                     & 4\%             \\
		\end{tabular}
		\vspace*{2mm}
		\caption{Top 8 combinations of available fields in raw reference data
			accounting for about 53\% of the total data (CN = container name, CRN =
			contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS =
			issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value. Identifiers emphasized.}
		\label{table:fields}
	\end{center}
\end{table}

Overall, a map-reduce style\citep{dean2010mapreduce} approach is
followed\footnote{While the operations are similar, the processing is not
	distributed but runs on a single machine. For space efficiency, zstd\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
for some
uniformity in the overall processing. We extract (key, document) tuples (as
TSV) from the raw JSON data and sort by key. We then group documents with the
same key and apply a function on each group in order to generate
our target schema or perform
additional operations such as deduplication or fusion of matched and unmatched references.

The key derivation can be exact (via an identifier like DOI, PMID, etc) or
based on a value normalization, like slugifying a title string. For identifier
based matches we can generate the target schema directly.  For fuzzy matching
candidates, we pass possible match pairs through a verification procedure,
which is implemented for \emph{release entity}\footnote{\url{https://guide.fatcat.wiki/entity_release.html}.} pairs. This procedure is a
domain dependent rule based verification, able to identify different versions
of a publication, preprint-published pairs and documents, which are
are similar by various metrics calculated over title and author fields. The fuzzy matching
approach is applied on all reference documents without identifier (a title is
currently required).

With a few schema conversions, fuzzy matching can be applied to Wikipedia
articles and Open Library (edition) records as well. The aspect of precision
and recall are represented by the two stages: we are generous in the match
candidate generation phase in order to improve recall, but we are strict during
verification, in order to control precision. Quality assurance for verification is
implemented through a growing list of test cases of real examples from the catalog and
their expected or desired match status\footnote{The list can be found under:
	\url{https://gitlab.com/internetarchive/cgraph/-/blob/master/skate/testdata/verify.csv}.
	It is helpful to keep this test suite independent of any specific programming language.}.


\section{Limitations and Future Work}

As other dataset in this field we expect this dataset to be iterated upon.

\begin{itemize}
	\item The fatcat catalog updates its metadata
	      continously\footnote{A changelog can currenly be followed here:
		      \url{https://fatcat.wiki/changelog}} and web crawls are conducted
	      regularly.  Current processing pipelines cover raw reference snapshot
	      creation and derivation of the graph structure, which allows to rerun
	      processing based on updated data as it becomes available.

	\item Metadata extraction from PDFs depends on supervised machine learning
	      models, which in turn depend on available training datasets. With additional crawls and
	      metadata available we hope to improve models used for metadata
	      extraction, improving yield and reducing data extraction artifacts in
	      the process.

	\item As of this version, a number of raw reference
	      docs remain unmatched, which means that neither exact nor fuzzy matching
	      has detected a link to a known entity. On the one
	      hand, this can hint at missing metadata. However, parts of the data
	      will contain a reference to a catalogued entity, but in a specific,
	      dense and harder to recover form.
	      This also include improvements to the fuzzy matching approach.
	\item The reference dataset contains millions of URLs and their integration
	      into the graph has been implemented as prototype. A full implementation
	      requires a few data cleanup and normalization steps.
\end{itemize}

\section{Acknowledgements}

This work is partially supported by a grant from the \emph{Andrew W. Mellon
	Foundation}.


\section{Appendix A}


A note on data quality: While we implement various data quality measures,
real-world data, especially coming from many different sources will contain
issues. Among other measures, we keep track of match reasons,
especially for fuzzy matching to be able to zoom in on systematic errors
more easily (see~Table~\ref{table:matches}).

\begin{table}[]
	\footnotesize
	\captionsetup{font=normalsize}
	\begin{center}
		\begin{tabular}{@{}rlll@{}}
			\toprule
			\textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason}      \\ \midrule
			934932865      & crossref            & exact           & doi                  \\
			151366108      & fatcat-datacite     & exact           & doi                  \\
			65345275       & fatcat-pubmed       & exact           & pmid                 \\
			48778607       & fuzzy               & strong          & jaccardauthors       \\
			42465250       & grobid              & exact           & doi                  \\
			29197902       & fatcat-pubmed       & exact           & doi                  \\
			19996327       & fatcat-crossref     & exact           & doi                  \\
			11996694       & fuzzy               & strong          & slugtitleauthormatch \\
			9157498        & fuzzy               & strong          & tokenizedauthors     \\
			3547594        & grobid              & exact           & arxiv                \\
			2310025        & fuzzy               & exact           & titleauthormatch     \\
			1496515        & grobid              & exact           & pmid                 \\
			680722         & crossref            & strong          & jaccardauthors       \\
			476331         & fuzzy               & strong          & versioneddoi         \\
			449271         & grobid              & exact           & isbn                 \\
			230645         & fatcat-crossref     & strong          & jaccardauthors       \\
			190578         & grobid              & strong          & jaccardauthors       \\
			156657         & crossref            & exact           & isbn                 \\
			123681         & fatcat-pubmed       & strong          & jaccardauthors       \\
			79328          & crossref            & exact           & arxiv                \\
			57414          & crossref            & strong          & tokenizedauthors     \\
			53480          & fuzzy               & strong          & pmiddoipair          \\
			52453          & fuzzy               & strong          & dataciterelatedid    \\
			47119          & grobid              & strong          & slugtitleauthormatch \\
			36774          & fuzzy               & strong          & arxivversion         \\
			% 35311          & fuzzy               & strong          & customieeearxiv      \\
			% 33863          & grobid              & exact           & pmcid                \\
			% 23504          & crossref            & strong          & slugtitleauthormatch \\
			% 22753          & fatcat-crossref     & strong          & tokenizedauthors     \\
			% 17720          & grobid              & exact           & titleauthormatch     \\
			% 14656          & crossref            & exact           & titleauthormatch     \\
			% 14438          & grobid              & strong          & tokenizedauthors     \\
			% 7682           & fatcat-crossref     & exact           & arxiv                \\
			% 5972           & fatcat-crossref     & exact           & isbn                 \\
			% 5525           & fatcat-pubmed       & exact           & arxiv                \\
			% 4290           & fatcat-pubmed       & strong          & tokenizedauthors     \\
			% 2745           & fatcat-pubmed       & exact           & isbn                 \\
			% 2342           & fatcat-pubmed       & strong          & slugtitleauthormatch \\
			% 2273           & fatcat-crossref     & strong          & slugtitleauthormatch \\
			% 1960           & fuzzy               & exact           & workid               \\
			% 1150           & fatcat-crossref     & exact           & titleauthormatch     \\
			% 1041           & fatcat-pubmed       & exact           & titleauthormatch     \\
			% 895            & fuzzy               & strong          & figshareversion      \\
			% 317            & fuzzy               & strong          & titleartifact        \\
			% 82             & grobid              & strong          & titleartifact        \\
			% 33             & crossref            & strong          & titleartifact        \\
			% 5              & fuzzy               & strong          & custombsiundated     \\
			% 1              & fuzzy               & strong          & custombsisubdoc      \\
			% 1              & fatcat              & exact           & doi                  \\ \bottomrule
		\end{tabular}
		\vspace*{2mm}
		\caption{Table of match counts (top 25), reference provenance, match status and
			match reason. The match reason identifier encode a specific rule in the domain
			dependent verification process and are included for completeness - we do not
			include the details of each rule in this report.}
		\label{table:matches}
	\end{center}
\end{table}

\bibliographystyle{abbrv}
% \bibliographystyle{plainnat}
\bibliography{refs}
\end{document}