\documentclass[hidelinks,10pt,twocolumn]{article}
\usepackage{simpleConference}
\usepackage[utf8]{inputenc}
\usepackage{times}
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{doi}
\usepackage{amssymb}
\usepackage{url,hyperref}
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{caption}
\usepackage{datetime}
\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
\setlength{\parindent}{0pt}

\begin{document}

\title{Refcat: The Fatcat Citation Graph}

\author{Martin Czygan \\
	\\
	Internet Archive \\
	San Francisco, California, USA \\
	martin@archive.org  \\
	\and
	Bryan Newbold \\
	\\
	Internet Archive \\
	San Francisco, California, USA \\
	bnewbold@archive.org  \\
	\\
}


\maketitle
\thispagestyle{empty}


\begin{abstract}
	As part of its scholarly data efforts, the Internet Archive releases a first version of a citation
	graph dataset, named \emph{refcat}, derived from scholarly publications and
	additional data sources. It is composed of data gathered by the fatcat
	cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale
	crawls targeting primary and secondary scholarly outputs, as well as metadata
	from the Open Library\footnote{\url{https://openlibrary.org}} project and
	Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
	graph consists of over 1.3B citations. We release this dataset under a CC0
	Public Domain Dedication, accessible through an archive
	item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}.
	The source code used for the derivation process, including exact and fuzzy
	citation matching, is released under an MIT
	license\footnote{\url{https://gitlab.com/internetarchive/refcat}}.
\end{abstract}

\keywords{Citation Graph, Web Archiving}

\section{Introduction}

The Internet Archive releases a first version of a citation graph dataset
derived from a raw corpus of about 2.5B references gathered from metadata and
data obtained by PDF extraction and annotation tools such as
GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with
metadata from Open Library and Wikipedia.
The goal of this report is to describe briefly the current contents and the
derivation of the dataset. We expect
this dataset to be iterated upon, with changes both in content and processing.

According to~\citep{jinha_2010} over 50M scholarly articles have been published
(from 1726) up to 2009, with the rate of publications on the
rise~\citep{landhuis_2016}. In 2014, a study based on academic search engines
estimated that at least 114M English-language scholarly documents are
accessible on the web~\citep{khabsa_giles_2014}.

Modern citation indexes can be traced back to the early computing age, when
projects like the Science Citation Index (1955)~\citep{garfield2007evolution}
were first devised, living on in existing commercial knowledge bases today.
Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
references~\citep{shotton2013publishing}. Other notable projects
include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
decade has seen the emergence of more openly available, large scale
citation projects like Microsoft Academic~\citep{sinha2015overview} or the
Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}.
In 2021, over one billion citations are publicly available, marking a ``tipping point''
for this category of data~\citep{hutchins2021tipping}.

While a paper will often cite other papers, more citable entities exist such
as books or web links and within links a variety of targets, such as web
pages, reference entries, protocols or datasets. References can be extracted
manually or through more automated methods, such as metadata access and
structured data extraction from full text documents; the latter offering the
benefits of scalability. The completeness of bibliographic metadata ranges from
documents with one or more persistant identifiers to raw, potentially unclean
strings partially describing a scholarly artifact.

\section{Related Work}

Two typical problems which arise in the process of compiling a citation graph
dataset are related to data aquisition and citation matching. Data acquisition
itself can take different forms: bibliographic metadata can contain explicit
reference data as provided by publishers and aggregators; this data can be
relatively consistent when looked at per source, but may vary in style and
comprehensiveness when looked at as a whole. Another way of acquiring
bibliographic metadata is to analyze a source document, such as a PDF (or its
text), directly. Tools in this category are often based on conditional random
fields~\citep{lafferty2001conditional} and have been implemented in projects
such as ParsCit~\citep{councill2008parscit},
Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite} or
GROBID~\citep{lopez2009grobid}.

The problem of citation matching is relatively simple when common, persistent
identifiers are present in the data. Complications mount, when there is
\emph{Identity Uncertainty}, that is ``objects are not labeled with unique
identifiers or when those identifiers may not be perceived
perfectly''~\citep{pasula2003identity}. CiteSeer has been an early project
concerned with citation matching~\citep{giles1998citeseer}. A taxonomy of
potential issues common in the matching process has been compiled
by~\citep{olensky2016evaluation}.  Additional care is required, when the
citation matching process is done at scale~\citep{fedoryszak2013large}. The
problem of heterogenity has been discussed in the context of datasets
by~\citep{mathiak2015challenges}.

Projects centered around citations or containing citation data as a core
component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI
citations'', which was first released
2018-07-29\footnote{\url{https://opencitations.net/download}} and has been
regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
``a Wikimedia initiative to develop open citations and linked bibliographic
data to serve free knowledge'' continously adds citations to its
database\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
with \emph{PaperReferences} being one relation among many others.


% There are a few large scale citation dataset available today. COCI, the
% ``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
% released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
% 2021-07-29, it contains
% 1,094,394,688 citations across 65,835,422 bibliographic
% resources~\citep{peroni2020opencitations}.
%
% The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
% ``a Wikimedia initiative to develop open citations and linked bibliographic
% data to serve free knowledge'' continously adds citations to its database and
% as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
% publications\footnote{\url{http://wikicite.org/statistics.html}}.
%
% Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
% entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
% with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
% 	\url{https://archive.org/details/mag-2021-06-07}}  the
% \emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466
% bibliographic entities.
%
% Numerous other projects have been or are concerned with various aspects of
% citation discovery and curation as part their feature set, among them Semantic
% Scholar~\citep{fricke2018semantic}, CiteSeerX~\citep{li2006citeseerx} or Aminer~\citep{tang2016aminer}.
%
% As mentioned in~\citep{hutchins2021tipping}, the number of openly available
% citations is not expected to shrink in the future.


\section{Dataset}

We release the first version of the \emph{refcat} dataset in an format used
internally for storage and to serve queries (and which we call \emph{biblioref}
or \emph{bref} for short). The dataset includes metadata from fatcat, the Open
Library project and inbound links from the English Wikipedia.  The dataset is
integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users
to explore inbound and outbound
references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}.

The format records source and target (fatcat release and work) identifiers, a
few attributes from the metadata (such as year or release stage) as well as
information about the match status and provanance.

The dataset currently contains 1,323,423,672 citations across 76,327,662
entities (55,123,635 unique source and 60,244,206 unique target work
identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI
for both source and target).  The majority of matches - 1,250,523,321 - are
established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN).
72,900,351 citations are established through fuzzy matching techniques. The
majority of citations between COCI and \emph{refcat} overlap, as can be seen
in~Table~\ref{table:cocicmp}.

\begin{table}[]
	\begin{center}
		\begin{tabular}{ll}
			\toprule
			\bf{Set}              & \bf{Count}    \\

			\midrule
			COCI (C)              & 1,094,394,688 \\
			\emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
			C $\cap$ R            & 1,007,539,966 \\
			C $\setminus$ R       & 86,854,309    \\
			R $\setminus$ C       & 295,884,246
		\end{tabular}
		\vspace*{2mm}
		\caption{Comparison between COCI and \emph{refcat-doi}, a subset of
			\emph{refcat} where entities have a known DOI. At least 50\% of the
			295,884,246 references only in \emph{refcat-doi} come from links
			recorded within a specific dataset provider (GBIF, DOI prefix:
			10.15468).}
		\label{table:cocicmp}
	\end{center}
\end{table}

% zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
% zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
% find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst

% TODO: some more numbers on the structure


\section{System Design}

\subsection{Constraints}

The constraints for the system design are informed by the volume and the
variety of the data. The capability to run the whole graph derivation on a
single machine was a minor goal as well. In total, the raw inputs amount to a
few terabytes of textual content, mostly newline delimited JSON. More
importantly, while the number of data fields is low, certain documents are very
partial with hundreds of different combinations of available field values found
in the raw reference data. This is most likely caused by aggregators passing on
reference data coming from hundreds of sources, each of which not necessarily
agreeing on a common granularity for citation data and from artifacts of
machine learning based structured data extraction tools.

Each combination of fields may require a slightly different processing path.
For example, references with an Arxiv identifier can be processed differently
from references with only a title. Over 50\% of the raw reference data comes
from a set of eight field set variants, as listed in
Table~\ref{table:fields}.

\begin{table}[]
	\begin{center}
		\begin{tabular}{ll}
			\toprule
			\bf{Fields}                                                                                     & \bf{Percentage} \\
			\midrule
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ P $\cdot$ T $\cdot$  U $\cdot$  V $\cdot$ Y}          & 14\%            \\
			\multicolumn{1}{l}{\textbf{DOI}}                                                                & 14\%            \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ U $\cdot$ V $\cdot$ Y}           & 4\%             \\
			\multicolumn{1}{l}{\textbf{PMID} $\cdot$ U}                                                     & 4\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ T $\cdot$ V $\cdot$ Y}           & 4\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y}                                                    & 4\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ V $\cdot$ Y}                     & 4\%             \\
		\end{tabular}
		\vspace*{2mm}
		\caption{Top 8 combinations of available fields in raw reference data
			accounting for about 53\% of the total data (CN = container name, CRN =
			contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS =
			issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value. Identifiers emphasized.}
		\label{table:fields}
	\end{center}
\end{table}

\subsection{Data Sources}

Reference data comes from two main sources: explicit bibliographic metadata and
PDF extraction. The bibliographic metadata is taken from fatcat, which itself
harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv,
Datacite, DOAJ, dblp and others into its catalog (as the source permits, data
is processed continously or in batches). Reference data from PDF documents has
been extracted with GROBID\footnote{GROBID
\href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the
TEI-XML results being cached locally in a key-value store accessible with an S3
API. Archived PDF documents result from dedicated web-scale crawls of scholarly
domains conducted with
Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} (and
other crawl technologies) and a variety of seed lists targeting journal
homepages, repositories, dataset providers, aggregators, web archives and other
venues. A processing pipeline merges catalog data from the primary database and
cached data from the key-value store and generates the set of about 2.5B
references documents, which currently serve as an input for the citation graph
derivation pipeline.

\subsection{Methodology}

Overall, a map-reduce style~\citep{dean2010mapreduce} approach is
followed\footnote{While the operations are similar, the processing is not
	distributed but runs on a single machine. For space efficiency, zstd~\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
for some
uniformity in the overall processing. We extract (key, document) tuples (as
TSV) from the raw JSON data and sort by key. We then group documents with the
same key and apply a function on each group in order to generate
our target schema or perform
additional operations such as deduplication or fusion of matched and unmatched references for indexing.

The key derivation can be exact (via an identifier like DOI, PMID, etc) or
based on a value normalization, like ``slugifying'' a title string. For identifier
based matches we can generate the target schema directly.  For fuzzy matching
candidates, we pass possible match pairs through a verification procedure,
which is implemented for \emph{release entity}\footnote{\url{https://guide.fatcat.wiki/entity_release.html}.} pairs. This procedure is a
domain dependent rule based verification, able to identify different versions
of a publication, preprint-published pairs and documents, which are
are similar by various metrics calculated over title and author fields. The fuzzy matching
approach is applied on all reference documents without identifier (a title is
currently required).

We currently implement performance sensitive parts in
Go\footnote{\url{https://golang.org/}}, with various processing stages (e.g.
conversion, map, reduce, ...) represented by separate command line tools. A
thin task orchestration layer using the luigi
framework\footnote{\url{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani},
	which has been used in various scientific pipeline
	application, like~\citep{schulz2016use},~\citep{erdmann2017design},~\citep{lampa2019scipipe},~\citep{czygan2014design}
	and others.} allows for experimentation in the pipeline and for single command
derivations, as data dependencies are encoded with the help of the
orchestrator. Within the tasks, we also utilize classic platform tools such as
\emph{sort}~\citep{mcilroy1971research}.

With a few schema conversions, fuzzy matching can be applied to Wikipedia
articles and Open Library (edition) records as well. The aspect of precision
and recall are represented by the two stages: we are generous in the match
candidate generation phase in order to improve recall, but we are strict during
verification, in order to control precision. Quality assurance for verification is
implemented through a growing list of test cases of real examples from the catalog and
their expected or desired match status\footnote{The list can be found under:
	\url{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}.
	It is helpful to keep this test suite independent of any specific programming language.}.


\section{Limitations and Future Work}

As other dataset in this field we expect this dataset to be iterated upon.

\begin{itemize}
	\item The fatcat catalog updates its metadata
	      continously\footnote{A changelog can currenly be followed here:
		      \url{https://fatcat.wiki/changelog}.} and web crawls are conducted
	      regularly.  Current processing pipelines cover raw reference snapshot
	      creation and derivation of the graph structure, which allows to rerun
	      processing based on updated data as it becomes available.

	\item Metadata extraction from PDFs depends on supervised machine learning
	      models, which in turn depend on available training datasets. With additional crawls and
	      metadata available we hope to improve models used for metadata
	      extraction, improving yield and reducing data extraction artifacts in
	      the process.

	\item As of this version, a number of raw reference
	      docs remain unmatched, which means that neither exact nor fuzzy matching
	      has detected a link to a known entity. On the one
	      hand, this can hint at missing metadata. However, parts of the data
	      will contain a reference to a catalogued entity, but in a specific,
	      dense and harder to recover form.
	      This also include improvements to the fuzzy matching approach.
	\item The reference dataset contains millions of URLs and their integration
	      into the graph has been implemented as a prototype. A full implementation
	      requires a few data cleanup and normalization steps.
\end{itemize}

\section{Acknowledgements}

This work is partially supported by a grant from the \emph{Andrew W. Mellon
	Foundation}.


\section{Appendix A}


A note on data quality: While we implement various data quality measures,
real-world data, especially coming from many different sources will contain
issues. Among other measures, we keep track of match reasons,
especially for fuzzy matching to be able to zoom in on systematic errors
more easily (see~Table~\ref{table:matches}).

\begin{table}[]
	\footnotesize
	\captionsetup{font=normalsize}
	\begin{center}
		\begin{tabular}{@{}rlll@{}}
			\toprule
			\textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason}      \\ \midrule
			934932865      & crossref            & exact           & doi                  \\
			151366108      & fatcat-datacite     & exact           & doi                  \\
			65345275       & fatcat-pubmed       & exact           & pmid                 \\
			48778607       & fuzzy               & strong          & jaccardauthors       \\
			42465250       & grobid              & exact           & doi                  \\
			29197902       & fatcat-pubmed       & exact           & doi                  \\
			19996327       & fatcat-crossref     & exact           & doi                  \\
			11996694       & fuzzy               & strong          & slugtitleauthormatch \\
			9157498        & fuzzy               & strong          & tokenizedauthors     \\
			3547594        & grobid              & exact           & arxiv                \\
			2310025        & fuzzy               & exact           & titleauthormatch     \\
			1496515        & grobid              & exact           & pmid                 \\
			680722         & crossref            & strong          & jaccardauthors       \\
			476331         & fuzzy               & strong          & versioneddoi         \\
			449271         & grobid              & exact           & isbn                 \\
			230645         & fatcat-crossref     & strong          & jaccardauthors       \\
			190578         & grobid              & strong          & jaccardauthors       \\
			156657         & crossref            & exact           & isbn                 \\
			123681         & fatcat-pubmed       & strong          & jaccardauthors       \\
			79328          & crossref            & exact           & arxiv                \\
			57414          & crossref            & strong          & tokenizedauthors     \\
			53480          & fuzzy               & strong          & pmiddoipair          \\
			52453          & fuzzy               & strong          & dataciterelatedid    \\
			47119          & grobid              & strong          & slugtitleauthormatch \\
			36774          & fuzzy               & strong          & arxivversion         \\
			% 35311          & fuzzy               & strong          & customieeearxiv      \\
			% 33863          & grobid              & exact           & pmcid                \\
			% 23504          & crossref            & strong          & slugtitleauthormatch \\
			% 22753          & fatcat-crossref     & strong          & tokenizedauthors     \\
			% 17720          & grobid              & exact           & titleauthormatch     \\
			% 14656          & crossref            & exact           & titleauthormatch     \\
			% 14438          & grobid              & strong          & tokenizedauthors     \\
			% 7682           & fatcat-crossref     & exact           & arxiv                \\
			% 5972           & fatcat-crossref     & exact           & isbn                 \\
			% 5525           & fatcat-pubmed       & exact           & arxiv                \\
			% 4290           & fatcat-pubmed       & strong          & tokenizedauthors     \\
			% 2745           & fatcat-pubmed       & exact           & isbn                 \\
			% 2342           & fatcat-pubmed       & strong          & slugtitleauthormatch \\
			% 2273           & fatcat-crossref     & strong          & slugtitleauthormatch \\
			% 1960           & fuzzy               & exact           & workid               \\
			% 1150           & fatcat-crossref     & exact           & titleauthormatch     \\
			% 1041           & fatcat-pubmed       & exact           & titleauthormatch     \\
			% 895            & fuzzy               & strong          & figshareversion      \\
			% 317            & fuzzy               & strong          & titleartifact        \\
			% 82             & grobid              & strong          & titleartifact        \\
			% 33             & crossref            & strong          & titleartifact        \\
			% 5              & fuzzy               & strong          & custombsiundated     \\
			% 1              & fuzzy               & strong          & custombsisubdoc      \\
			% 1              & fatcat              & exact           & doi                  \\ \bottomrule
		\end{tabular}
		\vspace*{2mm}
		\caption{Table of match counts (top 25), reference provenance, match
			status and match reason. Provenance currently can name the raw
			origin (e.g. \emph{crossref}) or the method (e.g. \emph{fuzzy}). The match reason
			identifier encode a specific rule in the domain dependent
			verification process and are included for completeness - we do not
			include the details of each rule in this report.}
		\label{table:matches}
	\end{center}
\end{table}

\bibliographystyle{abbrv}
% \bibliographystyle{plainnat}
\bibliography{refs}
\end{document}