docs/Simple/main.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349

\documentclass[hidelinks,10pt,twocolumn]{article}
\usepackage{simpleConference}
\usepackage[utf8]{inputenc}
\usepackage{times}
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{doi}
\usepackage{amssymb}
\usepackage{url,hyperref}
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{caption}

\usepackage{datetime}
\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
\setlength{\parindent}{0pt}

\begin{document}

\title{Fatcat Reference Dataset}

\author{Martin Czygan \\
\\
Internet Archive \\
San Francisco, California, USA \\
martin@archive.org  \\
\and
Bryan Newbold \\
\\
Internet Archive \\
San Francisco, California, USA \\
bnewbold@archive.org  \\
\\
}


\maketitle
\thispagestyle{empty}


\begin{abstract}
As part of its scholarly data efforts, the Internet Archive releases a first version of a citation
graph dataset, named \emph{refcat}, derived from scholarly publications and
additional data sources. It is composed of data gathered by the fatcat
cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale
crawls targeting primary and secondary scholarly outputs, as well as metadata
from the Open Library\footnote{\url{https://openlibrary.org}} project and
Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
graph consists of 1,323,423,672 citations. We release this dataset under a CC0
Public Domain Dedication, accessible through an archive
collection\footnote{\url{https://archive.org/details/refcat_2021-07-28}}. All
code used in the derivation process is releases under an MIT
license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
\end{abstract}

\keywords{Citation Graph, Web Archiving}

\section{Introduction}


The Internet Archive releases a first version of a citation graph dataset
derived from a raw corpus of about 2.5B references gathered from metadata and
data obtained by PDF extraction tools such as
GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
metadata from Open Library and Wikipedia.
The goal of this report is to describe briefly the current contents and the
derivation of the dataset. We expect
this dataset to be iterated upon, with changes both in content and processing.

Modern citation indexes can be traced back to the early computing age, when
projects like the Science Citation Index (1955)\citep{garfield2007evolution}
were first devised, living on in existing commercial knowledge bases today.
Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
references\citep{shotton2013publishing}. Other notable early projects
include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
decade has seen the emergence of more openly available, large scale
citation projects like Microsoft Academic\citep{sinha2015overview} or the
Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
according to \citep{hutchins2021tipping} over 1B citations are publicly
available, marking a tipping point for this category of data.

\section{Related Work}

There are a few large scale citation dataset available today. COCI, the
``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
2021-07-29, it contains
1,094,394,688 citations across 65,835,422 bibliographic
resources\citep{peroni2020opencitations}.

The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
``a Wikimedia initiative to develop open citations and linked bibliographic
data to serve free knowledge'' continously adds citations to its database and
as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
publications\footnote{\url{http://wikicite.org/statistics.html}}.

Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of
entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
\url{https://archive.org/details/mag-2021-06-07}}  the
\emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466
bibliographic entities.

Numerous other projects have been or are concerned with various aspects of
citation discovery and curation as part their feature set, among them Semantic
Scholar\citep{fricke2018semantic}, CiteSeerX\citep{li2006citeseerx} or Aminer\citep{tang2016aminer}.

As mentioned in \citep{hutchins2021tipping}, the number of openly available
citations is not expected to shrink in the future.


\section{Dataset}

We release the first version of the \emph{refcat} dataset
in an format used internally for storage and to serve queries (and which we
call \emph{biblioref} or \emph{bref} for short). The dataset includes metadata
from fatcat and the Open Library Project and inbound links from the English Wikipedia.

The format contains source and target (fatcat release and work) identifiers, a
few attributes from the metadata (such as year or release stage) as well as
information about the match status and provanance.

The dataset currently contains 1,323,423,672 citations across 76,327,662
entities (55,123,635 unique source and 60,244,206 unique target work identifiers).
The majority of matches - 1,250,523,321 - are established through identifier
based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are
established through fuzzy matching.

The majority of DOI based matches between \emph{refcat} and COCI overlap, as can be
seen in~Table~\ref{table:cocicmp}.

\begin{table}[]
    \begin{center}
    \begin{tabular}{ll}
\toprule
\bf{Set}          & \bf{Count} \\

\midrule
        COCI (C)        &   1,094,394,688    \\
        \emph{refcat-doi} (R)   &   1,303,424,212    \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
        C $\cap$ R      &   1,007,539,966    \\
        C $\setminus$ R &      86,854,309  \\
        R $\setminus$ C & 295,884,246
    \end{tabular}
    \vspace*{2mm}
	\caption{Comparison between COCI and \emph{refcat-doi}, a subset of
\emph{refcat} where entities have a known DOI. At least 50\% of the 295,884,246
references only in \emph{refcat-doi} come from links between datasets (GBIF,
DOI prefix: 10.15468).}
     \label{table:cocicmp}
    \end{center}
\end{table}

% zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
% zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
% find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst


\section{System Design}

The constraints for the systems design are informed by the volume and the
variety of the data. The capability to run the whole graph derivation on a
single machine was a minor goal as well. In total, the raw inputs amount to a few
TB of textual content, mostly newline delimited JSON. More importantly, while
the number of data fields is low, certain schemas are very partial with
hundreds of different combinations of available field values found in the raw
reference data. This is most likely caused by aggregators passing on reference
data coming from hundreds of sources, each of which not necessarily agreeing on
a common granularity for citation data and from artifacts of machine learning
based structured data extraction tools.

Each combination of fields may require a slightly different processing path.
For example, references with an Arxiv identifier can be processed differently
from references with only a title. Over 50\% of the raw reference data comes
from a set of eight field set manifestations, as listed in
Table~\ref{table:fields}.

\begin{table}[]
    \begin{center}
    \begin{tabular}{ll}
\toprule
        \bf{Fields}                                    & \bf{Percentage} \\
\midrule
    \multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$  U $\cdot$  V $\cdot$ Y}    & 14\%                              \\
    \multicolumn{1}{l}{\textbf{DOI}}                 & 14\%                              \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\%                               \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ U $\cdot$ V $\cdot$ Y}    & 4\%                               \\
        \multicolumn{1}{l}{\textbf{PMID} $\cdot$ U}              & 4\%                               \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ T $\cdot$ V $\cdot$ Y}    & 4\%                               \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y}            & 4\%                               \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ V $\cdot$ Y}      & 4\%                               \\
    \end{tabular}
    \vspace*{2mm}
    \caption{Top 8 combinations of available fields in raw reference data
        accounting for about 53\% of the total data (CN = container name, CRN =
contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS =
issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value. Identifiers emphasized.}
    \label{table:fields}
\end{center}
\end{table}

Overall, a map-reduce style approach is followed, which allows for some
uniformity in the overall processing. We extract (key, document) tuples (as
TSV) from the raw JSON data and sort by key. We then group documents with the
same key and apply a function on each group in order to generate
our target schema or perform
additional operations such as deduplication or fusion of matched and unmatched references.

The key derivation can be exact (like an identifier like DOI, PMID, etc) or
based on a value normalization, like slugifying a title string. For identifier
based matches we can generate the target schema directly.  For fuzzy matching
candidates, we pass possible match pairs through a verification procedure,
which is implemented for \emph{release entity} pairs. This procedure is a
domain dependent rule based verification, able to identify different versions
of a publication, preprint-published pairs and documents, which are
are similar by various metrics calculated over title and authors. The fuzzy matching
approach is applied on all reference documents without identifier (a title is
currently required).

With a few schema conversions, fuzzy matching can be applied to Wikipedia
articles and Open Library (edition) records as well. The aspect of precision
and recall are represented by the two stages: we are generous in the match
candidate generation phase in order to improve recall, but we are strict during
verification, in order to control precision. Quality assurance for verification is
implemented through a growing list of test cases of real examples from the catalog and
their expected or desired match status\footnote{The list can be found under:
\url{https://gitlab.com/internetarchive/cgraph/-/blob/master/skate/testdata/verify.csv}.
It is helpful to keep this test suite independent of any specific programming language.}.


\section{Limitations and Future Work}

As other dataset in this field we expect this dataset to be iterated upon.

\begin{itemize}
    \item The fatcat catalog updates its metadata
        continously\footnote{A changelog can currenly be followed here:
        \url{https://fatcat.wiki/changelog}} and web crawls are conducted
        regularly.  Current processing pipelines cover raw reference snapshot
        creation and derivation the graph structure, which allows to rerun
        processing based on updated data as it becomes available.

    \item Metadata extraction from PDFs depends on supervised machine learning
        models, which in turn depends training sets. With additional crawls and
        metadata available we hope to improve models used for metadata
        extraction, improving yield and reducing data extraction artifacts in
        the process.

    \item As of this version, a number of raw reference
        docs remain unmatched, which means that neither exact nor fuzzy matching
        can detect a link to a known entity. On the one
        hand, this can hint at missing metadata. However, parts of the data
        will contain a reference to a catalogued entity, but in a specific,
        dense and harder to recover form.
        This also include improvements to the fuzzy matching approach.
    \end{itemize}

\section{Acknowledgements}

This work is partially supported by a grant from the \emph{Andrew W. Mellon
Foundation}. We like to thanks various teams at the Internet Archive for
providing necessary infrastructure, and also data processing expertise. We are
also indebted to various open source software tools and their maintainers as
well as open scholarly data projects - without those this work would be much
harder if possible at all.


\section{Appendix A}


A note on data quality: While we implement various data quality measures,
real-world data, especially coming from many different sources will contain
errors and bugs. Among other measures, we keep track of match reasons,
especially for fuzzy matching to be able to zoom in on systematic errors a bit
more easily (see~Table~\ref{table:matches}).

\begin{table}[]
    \footnotesize
    \captionsetup{font=normalsize}
    \begin{center}
\begin{tabular}{@{}rlll@{}}
\toprule
\textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason} \\ \midrule
934932865                  & crossref                  & exact                 & doi                   \\
151366108                  & fatcat-datacite           & exact                 & doi                   \\
65345275                   & fatcat-pubmed             & exact                 & pmid                  \\
48778607                   & fuzzy                     & strong                & jaccardauthors        \\
42465250                   & grobid                    & exact                 & doi                   \\
29197902                   & fatcat-pubmed             & exact                 & doi                   \\
19996327                   & fatcat-crossref           & exact                 & doi                   \\
11996694                   & fuzzy                     & strong                & slugtitleauthormatch  \\
9157498                    & fuzzy                     & strong                & tokenizedauthors      \\
3547594                    & grobid                    & exact                 & arxiv                 \\
2310025                    & fuzzy                     & exact                 & titleauthormatch      \\
1496515                    & grobid                    & exact                 & pmid                  \\
680722                     & crossref                  & strong                & jaccardauthors        \\
476331                     & fuzzy                     & strong                & versioneddoi          \\
449271                     & grobid                    & exact                 & isbn                  \\
230645                     & fatcat-crossref           & strong                & jaccardauthors        \\
190578                     & grobid                    & strong                & jaccardauthors        \\
156657                     & crossref                  & exact                 & isbn                  \\
123681                     & fatcat-pubmed             & strong                & jaccardauthors        \\
79328                      & crossref                  & exact                 & arxiv                 \\
57414                      & crossref                  & strong                & tokenizedauthors      \\
53480                      & fuzzy                     & strong                & pmiddoipair           \\
52453                      & fuzzy                     & strong                & dataciterelatedid     \\
47119                      & grobid                    & strong                & slugtitleauthormatch  \\
36774                      & fuzzy                     & strong                & arxivversion          \\
35311                      & fuzzy                     & strong                & customieeearxiv       \\
33863                      & grobid                    & exact                 & pmcid                 \\
23504                      & crossref                  & strong                & slugtitleauthormatch  \\
22753                      & fatcat-crossref           & strong                & tokenizedauthors      \\
17720                      & grobid                    & exact                 & titleauthormatch      \\
14656                      & crossref                  & exact                 & titleauthormatch      \\
14438                      & grobid                    & strong                & tokenizedauthors      \\
7682                       & fatcat-crossref           & exact                 & arxiv                 \\
5972                       & fatcat-crossref           & exact                 & isbn                  \\
5525                       & fatcat-pubmed             & exact                 & arxiv                 \\
4290                       & fatcat-pubmed             & strong                & tokenizedauthors      \\
2745                       & fatcat-pubmed             & exact                 & isbn                  \\
2342                       & fatcat-pubmed             & strong                & slugtitleauthormatch  \\
2273                       & fatcat-crossref           & strong                & slugtitleauthormatch  \\
1960                       & fuzzy                     & exact                 & workid                \\
1150                       & fatcat-crossref           & exact                 & titleauthormatch      \\
1041                       & fatcat-pubmed             & exact                 & titleauthormatch      \\
895                        & fuzzy                     & strong                & figshareversion       \\
317                        & fuzzy                     & strong                & titleartifact         \\
82                         & grobid                    & strong                & titleartifact         \\
33                         & crossref                  & strong                & titleartifact         \\
5                          & fuzzy                     & strong                & custombsiundated      \\
1                          & fuzzy                     & strong                & custombsisubdoc       \\
1                          & fatcat                    & exact                 & doi                   \\ \bottomrule
\end{tabular}
    \vspace*{2mm}
    \caption{Table of match counts, reference provenance, match status and
match reason. The match reason identifier encode a specific rule in the domain
dependent verification process and are included for completeness - we do not
include the details of each rule in this report.}
    \label{table:matches}
\end{center}
\end{table}

\bibliographystyle{abbrv}
% \bibliographystyle{plainnat}
\bibliography{refs}
\end{document}