aboutsummaryrefslogtreecommitdiffstats
path: root/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
blob: 2a60a773cb1a272088568bc4227ba250e8ba927a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
\documentclass[hidelinks,10pt,twocolumn]{article}
\usepackage{simpleConference}
\usepackage[utf8]{inputenc}
\usepackage{times}
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{doi}
\usepackage{amssymb}
\usepackage{url,hyperref}
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{caption}

\usepackage{datetime}
\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
\setlength{\parindent}{0pt}

\begin{document}

\title{REFCAT: The Fatcat Citation Graph}

\author{Martin Czygan \\
	\\
	Internet Archive \\
	San Francisco, California, USA \\
	martin@archive.org  \\
	\and
	Bryan Newbold \\
	\\
	Internet Archive \\
	San Francisco, California, USA \\
	bnewbold@archive.org  \\
	\\
}


\maketitle
\thispagestyle{empty}


\begin{abstract}
	As part of its scholarly data efforts, the Internet Archive releases a first version of a citation
	graph dataset, named \emph{refcat}, derived from scholarly publications and
	additional data sources. It is composed of data gathered by the fatcat
	cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale
	crawls targeting primary and secondary scholarly outputs, as well as metadata
	from the Open Library\footnote{\url{https://openlibrary.org}} project and
	Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
	graph consists of over 1.3B citations. We release this dataset under a CC0
	Public Domain Dedication, accessible through an archive
	item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}.
	The source code used for the derivation process, including exact and fuzzy
	citation matching, is released under an MIT
	license\footnote{\url{https://gitlab.com/internetarchive/refcat}}.
\end{abstract}

\keywords{Citation Graph, Web Archiving}

\section{Introduction}

The Internet Archive releases a first version of a citation graph dataset
derived from a raw corpus of about 2.5B references gathered from metadata and
data obtained by PDF extraction and annotation tools such as
GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with
metadata from Open Library and Wikipedia.
The goal of this report is to describe briefly the current contents and the
derivation of the dataset. We expect
this dataset to be iterated upon, with changes both in content and processing.

According to~\citep{jinha_2010} over 50M scholarly articles have been published
(from 1726) up to 2009, with the rate of publications on the
rise~\citep{landhuis_2016}. In 2014, a study based on academic search engines
estimated that at least 114M English-language scholarly documents are
accessible on the web~\citep{khabsa_giles_2014}.

Modern citation indexes can be traced back to the early computing age, when
projects like the Science Citation Index (1955)~\citep{garfield2007evolution}
were first devised, living on in existing commercial knowledge bases today.
Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
references~\citep{shotton2013publishing}. Other notable early projects
include CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
decade has seen the emergence of more openly available, large scale
citation projects like Microsoft Academic~\citep{sinha2015overview} or the
Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}.
In 2021, over one billion citations are publicly available, marking a ``tipping point''
for this category of data~\citep{hutchins2021tipping}.

While a paper will often cite other papers, more citable entities exist such
as books or web links and within links a variety of targets, such as web
pages, reference entries, protocols or datasets. References can be extracted
manually or through more automated methods, such as metadata access and
structured data extraction from full text documents; the latter offering the
benefits of scalability. The completeness of bibliographic metadata ranges from
documents with one or more persistant identifiers to raw, potentially unclean
strings partially describing a publication.

\section{Related Work}

Typical problems arising in the process of compiling a citation graph dataset
are data aquisition and citation matching. Data acquisition itself can take
different forms: bibliographic metadata can contain explicit reference data as
provided by publishers and aggregators; this data can be relatively consistent
when looked at per source, but may vary in style and comprehensiveness when
looked at as a whole. Another way of acquiring bibliographic metadata is to
analyze a source document, such as a PDF (or its text), directly. Tools in this
category are often based on conditial random
fields~\citep{lafferty2001conditional} and have been implemented in projects
such as ParsCit~\citep{councill2008parscit},
Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite}
or GROBID~\citep{lopez2009grobid}.

The problem of citation matching is relatively simple when common, persistent
identifiers are present in the data. Complications mount, when there is
\emph{Identity Uncertainty}, that is ``objects are not labeled with unique
identifiers or when those identifiers may not be perceived
perfectly''~\citep{pasula2003identity}. CiteSeer has been an early project
concerned with citation matching~\citep{giles1998citeseer}. A taxonomy of
potential issues common in the matching process has been compiled
by~\citep{olensky2016evaluation}.  Additional care is required, when the
citation matching process is done at scale~\citep{fedoryszak2013large}. The
problem of heterogenity has been discussed in the context of datasets
by~\citep{mathiak2015challenges}.




% There are a few large scale citation dataset available today. COCI, the
% ``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
% released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
% 2021-07-29, it contains
% 1,094,394,688 citations across 65,835,422 bibliographic
% resources~\citep{peroni2020opencitations}.
%
% The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
% ``a Wikimedia initiative to develop open citations and linked bibliographic
% data to serve free knowledge'' continously adds citations to its database and
% as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
% publications\footnote{\url{http://wikicite.org/statistics.html}}.
%
% Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
% entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
% with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
% 	\url{https://archive.org/details/mag-2021-06-07}}  the
% \emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466
% bibliographic entities.
%
% Numerous other projects have been or are concerned with various aspects of
% citation discovery and curation as part their feature set, among them Semantic
% Scholar~\citep{fricke2018semantic}, CiteSeerX~\citep{li2006citeseerx} or Aminer~\citep{tang2016aminer}.
%
% As mentioned in~\citep{hutchins2021tipping}, the number of openly available
% citations is not expected to shrink in the future.


\section{Dataset}

We release the first version of the \emph{refcat} dataset in an format used
internally for storage and to serve queries (and which we call \emph{biblioref}
or \emph{bref} for short). The dataset includes metadata from fatcat, the
Open Library project and inbound links from the English Wikipedia. The fatcat
project itself aggregates data from variety of open data sources, such as
Crossref\footnote{\url{https://crossref.org}}, PubMed\footnote{\url{https://pubmed.ncbi.nlm.nih.gov/}},
DataCite\footnote{\url{https://datacite.org}}, Directory of Open Access Jourals (DOAJ)\footnote{\url{https://doaj.org}}, dblp~\citep{ley2002dblp} and others,
as well as metadata generated from analysis of data preserved at the Internet
Archive and active crawls of publication sites on the web.

The dataset is
integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users
to explore inbound and outbound references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}.

The format records source and target (fatcat release and work) identifiers, a
few attributes from the metadata (such as year or release stage) as well as
information about the match status and provanance.

The dataset currently contains 1,323,423,672 citations across 76,327,662
entities (55,123,635 unique source and 60,244,206 unique target work
identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI
for both source and target).
The majority of matches - 1,250,523,321 - are established through identifier
based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are
established through fuzzy matching techniques.

The majority of citations between \emph{refcat} and COCI overlap, as can be
seen in~Table~\ref{table:cocicmp}.

\begin{table}[]
	\begin{center}
		\begin{tabular}{ll}
			\toprule
			\bf{Set}              & \bf{Count}    \\

			\midrule
			COCI (C)              & 1,094,394,688 \\
			\emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
			C $\cap$ R            & 1,007,539,966 \\
			C $\setminus$ R       & 86,854,309    \\
			R $\setminus$ C       & 295,884,246
		\end{tabular}
		\vspace*{2mm}
		\caption{Comparison between COCI and \emph{refcat-doi}, a subset of
			\emph{refcat} where entities have a known DOI. At least 50\% of the
			295,884,246 references only in \emph{refcat-doi} come from links
			recorded within a specific dataset provider (GBIF, DOI prefix:
			10.15468).}
		\label{table:cocicmp}
	\end{center}
\end{table}

% zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
% zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
% find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst


\section{System Design}

The constraints for the systems design are informed by the volume and the
variety of the data. The capability to run the whole graph derivation on a
single machine was a minor goal as well. In total, the raw inputs amount to a
few terabytes of textual content, mostly newline delimited JSON. More
importantly, while the number of data fields is low, certain schemas are very
partial with hundreds of different combinations of available field values found
in the raw reference data. This is most likely caused by aggregators passing on
reference data coming from hundreds of sources, each of which not necessarily
agreeing on a common granularity for citation data and from artifacts of
machine learning based structured data extraction tools.

Each combination of fields may require a slightly different processing path.
For example, references with an Arxiv identifier can be processed differently
from references with only a title. Over 50\% of the raw reference data comes
from a set of eight field set manifestations, as listed in
Table~\ref{table:fields}.

\begin{table}[]
	\begin{center}
		\begin{tabular}{ll}
			\toprule
			\bf{Fields}                                                                                     & \bf{Percentage} \\
			\midrule
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ P $\cdot$ T $\cdot$  U $\cdot$  V $\cdot$ Y}          & 14\%            \\
			\multicolumn{1}{l}{\textbf{DOI}}                                                                & 14\%            \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ U $\cdot$ V $\cdot$ Y}           & 4\%             \\
			\multicolumn{1}{l}{\textbf{PMID} $\cdot$ U}                                                     & 4\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ T $\cdot$ V $\cdot$ Y}           & 4\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y}                                                    & 4\%             \\
			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ V $\cdot$ Y}                     & 4\%             \\
		\end{tabular}
		\vspace*{2mm}
		\caption{Top 8 combinations of available fields in raw reference data
			accounting for about 53\% of the total data (CN = container name, CRN =
			contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS =
			issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value. Identifiers emphasized.}
		\label{table:fields}
	\end{center}
\end{table}

Overall, a map-reduce style~\citep{dean2010mapreduce} approach is
followed\footnote{While the operations are similar, the processing is not
	distributed but runs on a single machine. For space efficiency, zstd~\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
for some
uniformity in the overall processing. We extract (key, document) tuples (as
TSV) from the raw JSON data and sort by key. We then group documents with the
same key and apply a function on each group in order to generate
our target schema or perform
additional operations such as deduplication or fusion of matched and unmatched references.

The key derivation can be exact (via an identifier like DOI, PMID, etc) or
based on a value normalization, like ``slugifying'' a title string. For identifier
based matches we can generate the target schema directly.  For fuzzy matching
candidates, we pass possible match pairs through a verification procedure,
which is implemented for \emph{release entity}\footnote{\url{https://guide.fatcat.wiki/entity_release.html}.} pairs. This procedure is a
domain dependent rule based verification, able to identify different versions
of a publication, preprint-published pairs and documents, which are
are similar by various metrics calculated over title and author fields. The fuzzy matching
approach is applied on all reference documents without identifier (a title is
currently required).

With a few schema conversions, fuzzy matching can be applied to Wikipedia
articles and Open Library (edition) records as well. The aspect of precision
and recall are represented by the two stages: we are generous in the match
candidate generation phase in order to improve recall, but we are strict during
verification, in order to control precision. Quality assurance for verification is
implemented through a growing list of test cases of real examples from the catalog and
their expected or desired match status\footnote{The list can be found under:
	\url{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}.
	It is helpful to keep this test suite independent of any specific programming language.}.


\section{Limitations and Future Work}

As other dataset in this field we expect this dataset to be iterated upon.

\begin{itemize}
	\item The fatcat catalog updates its metadata
	      continously\footnote{A changelog can currenly be followed here:
		      \url{https://fatcat.wiki/changelog}} and web crawls are conducted
	      regularly.  Current processing pipelines cover raw reference snapshot
	      creation and derivation of the graph structure, which allows to rerun
	      processing based on updated data as it becomes available.

	\item Metadata extraction from PDFs depends on supervised machine learning
	      models, which in turn depend on available training datasets. With additional crawls and
	      metadata available we hope to improve models used for metadata
	      extraction, improving yield and reducing data extraction artifacts in
	      the process.

	\item As of this version, a number of raw reference
	      docs remain unmatched, which means that neither exact nor fuzzy matching
	      has detected a link to a known entity. On the one
	      hand, this can hint at missing metadata. However, parts of the data
	      will contain a reference to a catalogued entity, but in a specific,
	      dense and harder to recover form.
	      This also include improvements to the fuzzy matching approach.
	\item The reference dataset contains millions of URLs and their integration
	      into the graph has been implemented as a prototype. A full implementation
	      requires a few data cleanup and normalization steps.
\end{itemize}

\section{Acknowledgements}

This work is partially supported by a grant from the \emph{Andrew W. Mellon
	Foundation}.


\section{Appendix A}


A note on data quality: While we implement various data quality measures,
real-world data, especially coming from many different sources will contain
issues. Among other measures, we keep track of match reasons,
especially for fuzzy matching to be able to zoom in on systematic errors
more easily (see~Table~\ref{table:matches}).

\begin{table}[]
	\footnotesize
	\captionsetup{font=normalsize}
	\begin{center}
		\begin{tabular}{@{}rlll@{}}
			\toprule
			\textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason}      \\ \midrule
			934932865      & crossref            & exact           & doi                  \\
			151366108      & fatcat-datacite     & exact           & doi                  \\
			65345275       & fatcat-pubmed       & exact           & pmid                 \\
			48778607       & fuzzy               & strong          & jaccardauthors       \\
			42465250       & grobid              & exact           & doi                  \\
			29197902       & fatcat-pubmed       & exact           & doi                  \\
			19996327       & fatcat-crossref     & exact           & doi                  \\
			11996694       & fuzzy               & strong          & slugtitleauthormatch \\
			9157498        & fuzzy               & strong          & tokenizedauthors     \\
			3547594        & grobid              & exact           & arxiv                \\
			2310025        & fuzzy               & exact           & titleauthormatch     \\
			1496515        & grobid              & exact           & pmid                 \\
			680722         & crossref            & strong          & jaccardauthors       \\
			476331         & fuzzy               & strong          & versioneddoi         \\
			449271         & grobid              & exact           & isbn                 \\
			230645         & fatcat-crossref     & strong          & jaccardauthors       \\
			190578         & grobid              & strong          & jaccardauthors       \\
			156657         & crossref            & exact           & isbn                 \\
			123681         & fatcat-pubmed       & strong          & jaccardauthors       \\
			79328          & crossref            & exact           & arxiv                \\
			57414          & crossref            & strong          & tokenizedauthors     \\
			53480          & fuzzy               & strong          & pmiddoipair          \\
			52453          & fuzzy               & strong          & dataciterelatedid    \\
			47119          & grobid              & strong          & slugtitleauthormatch \\
			36774          & fuzzy               & strong          & arxivversion         \\
			% 35311          & fuzzy               & strong          & customieeearxiv      \\
			% 33863          & grobid              & exact           & pmcid                \\
			% 23504          & crossref            & strong          & slugtitleauthormatch \\
			% 22753          & fatcat-crossref     & strong          & tokenizedauthors     \\
			% 17720          & grobid              & exact           & titleauthormatch     \\
			% 14656          & crossref            & exact           & titleauthormatch     \\
			% 14438          & grobid              & strong          & tokenizedauthors     \\
			% 7682           & fatcat-crossref     & exact           & arxiv                \\
			% 5972           & fatcat-crossref     & exact           & isbn                 \\
			% 5525           & fatcat-pubmed       & exact           & arxiv                \\
			% 4290           & fatcat-pubmed       & strong          & tokenizedauthors     \\
			% 2745           & fatcat-pubmed       & exact           & isbn                 \\
			% 2342           & fatcat-pubmed       & strong          & slugtitleauthormatch \\
			% 2273           & fatcat-crossref     & strong          & slugtitleauthormatch \\
			% 1960           & fuzzy               & exact           & workid               \\
			% 1150           & fatcat-crossref     & exact           & titleauthormatch     \\
			% 1041           & fatcat-pubmed       & exact           & titleauthormatch     \\
			% 895            & fuzzy               & strong          & figshareversion      \\
			% 317            & fuzzy               & strong          & titleartifact        \\
			% 82             & grobid              & strong          & titleartifact        \\
			% 33             & crossref            & strong          & titleartifact        \\
			% 5              & fuzzy               & strong          & custombsiundated     \\
			% 1              & fuzzy               & strong          & custombsisubdoc      \\
			% 1              & fatcat              & exact           & doi                  \\ \bottomrule
		\end{tabular}
		\vspace*{2mm}
		\caption{Table of match counts (top 25), reference provenance, match
			status and match reason. Provenance currently can name the raw
			origin (e.g. \emph{crossref}) or the method (e.g. \emph{fuzzy}). The match reason
			identifier encode a specific rule in the domain dependent
			verification process and are included for completeness - we do not
			include the details of each rule in this report.}
		\label{table:matches}
	\end{center}
\end{table}

\bibliographystyle{abbrv}
% \bibliographystyle{plainnat}
\bibliography{refs}
\end{document}