docs/Simple/main.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299

\documentclass[10pt,twocolumn]{article}
\usepackage{simpleConference}
\usepackage[utf8]{inputenc}
\usepackage{times}
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{doi}
\usepackage{amssymb}
\usepackage{url,hyperref}
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.

\usepackage{datetime}
\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
\setlength{\parindent}{0pt}

\begin{document}

\title{Archive Scholar Reference Dataset}

\author{Martin Czygan \\
\\
Internet Archive \\
San Francisco, California, USA \\
martin@archive.org  \\
\and
Bryan Newbold \\
\\
Internet Archive \\
San Francisco, California, USA \\
bnewbold@archive.org  \\
\\
}


\maketitle
\thispagestyle{empty}


\begin{abstract}
As part of its scholarly data efforts, the Internet Archive releases a citation
graph dataset (ASREF) derived from scholarly publications and additional data
sources. It is composed of data gathered by the fatcat cataloging
project\footnote{\url{https://fatcat.wiki}} and related web-scale crawls
targeting primary and secondary scholarly outputs. In addition, relations are
worked out between scholarly publications, web pages and their archived copies,
books from the Open Library project as well as Wikipedia articles. This first
version of the graph consists of over X nodes and over Y edges. We release this
dataset under a Z open license under the collection as an archive
item\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All code
used in the derivation process is releases under an MIT
license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
\end{abstract}

\keywords{Citation Graph, Web Archiving}

\section{Introduction}


The Internet Archive releases a first version of a citation graph dataset
derived from a raw corpus of about 2.5B references gathered from metadata and
from data obtained by PDF extraction tools such as GROBID\cite{lopez2009grobid}.
The goal of this report is to describe briefly the current contents and the
derivation of the Archive Scholar Citations Dataset (ASC). We expect
this dataset to be iterated upon, with changes both in content and processing.

Modern citation indexes can be traced back to the early computing age, when
projects like the Science Citation Index (1955)\citep{garfield2007evolution}
were first devised, living on in existing commercial knowledge bases today.
Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
references\citep{shotton2013publishing}. Other notable sources from that time
include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
decade has seen an increase of more openly available reference dataset and
citation projects, like Microsoft Academic\citep{sinha2015overview} and
Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
according to \citep{hutchins2021tipping} over 1B citations are publicly
available, marking a tipping point for open citations.

\section{Related Work}

There are a few large scale citation dataset available today. COCI, the
``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
released 2018-07-29. As of its most recent release on 2021-07-29, it contains
1,094,394,688 citations across 65,835,422 bibliographic resources.

The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
``a Wikimedia initiative to develop open citations and linked bibliographic
data to serve free knowledge'' continously adds citations to its data base and
as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
publications\footnote{\url{http://wikicite.org/statistics.html}}.

Microsoft Academic Graph\footnote{A recent copy has been preserved at
\url{https://archive.org/details/mag-2021-06-07}} is comprised of a number of
entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
with PaperReferences being one relation among many others. As of 2021-06-07 the
PaperReferences relation contains 1,832,226,781 edges across YYY bibliographic
entities.

Numerous other projects have been or are concerned with various aspects of
citation discovery and curation, among them Semantic Scholar, CiteSeerX or
Aminer.

As mentioned in \citep{hutchins2021tipping}, the number of openly available
citations is not expected to shrink in the future.


\section{Citation Dataset}

We release the first version of the ASREF dataset in an format used internally
for storage and display (and which we call \emph{biblioref}). The format
contains source and target fatcat release and work identifiers, as well as few
attributes from the metadata (such as year or release stage) as well as
information about the match provenance (like match status or reason). For ease
of use, we include DOI as well, if available.

The dataset currently contains X unique bibliographic entities and Y citations.


TODO: how matches are established and a short note on overlap with COCI DOI.


\section{System Design}

The constraints for the systems design are informed by the volume and the
variety of the data. In total, the raw inputs amount to a few TB of textual
content, mostly newline delimited JSON. More importantly, while the number of
data fields is low, certain schemas are very partial with hundreds of different
combinations of available field values found in the raw reference data. This is
most likely caused by aggregators passing on reference data coming from
hundreds of sources, each of which not necessarily agreeing on a common
granularity for citation data and from artifacts of machine learning based
structured data extraction tools.

Each combination of fields may require a slightly different processing path.
For example, references with an Arxiv identifier can be processed differently
from references with only a title. Over 50\% of the raw reference data comes
from a set of eight field manifestations, as listed in
Table~\ref{table:fields}.

\begin{table}[]
    \begin{center}
    \begin{tabular}{ll}
\toprule
        \bf{Fields}                                    & \bf{Share} \\
\midrule
    \multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$  U $\cdot$  V $\cdot$ Y}    & 14\%                              \\
        \multicolumn{1}{l}{DOI}                 & 14\%                              \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\%                               \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ U $\cdot$ V $\cdot$ Y}    & 4\%                               \\
        \multicolumn{1}{l}{PMID $\cdot$ U}              & 4\%                               \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ T $\cdot$ V $\cdot$ Y}    & 4\%                               \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y}            & 4\%                               \\
        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ V $\cdot$ Y}      & 4\%                               \\
    \end{tabular}
    \vspace*{2mm}
    \caption{Top 8 combinations of available fields in raw reference data
        accounting for about 53\% of the total data (CN = container name, CRN =
contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS =
issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value.}
    \label{table:fields}
\end{center}
\end{table}

Overall, a map-reduce style approach is followed, which allows for some
uniformity in the overall processing. We extract (key, document) tuples (as
TSV) from the raw JSON data and sort by key. Then we group documents with the
same key into groups and apply a function on each group in order to generate
our target schema (currently named biblioref, or bref for short) or perform
addition operations (such as deduplication).

The key derivation can be exact (like an identifier like DOI, PMID, etc) or
based on a normalization procedure, like a slugified title string. For
identifier based matches we can generate the target biblioref schema directly.
For fuzzy matching candidates, we pass possible match pairs through a
verification procedure, which is implemented for release entity schema pairs.
The current verification procedure is a domain dependent rule based
verification, able to identify different versions of a publication,
preprint-published pairs or or other kind of similar documents by calculating
similarity metrics across title and authors. The fuzzy matching approach is
applied on all reference documents, which only have a title, but no identifier.

With a few schema conversions, fuzzy matching can be applied to Wikipedia
articles and Open Library (edition) records as well. The aspect of precision
and recall are represented by the two stages: we are generous in the match
candidate generation phase in order to improve recall, but we are strict during
verification, in order to control precision.


\section{Fuzzy Matching Approach}
\section{Quality Assurance}


\section{Future Work}

As other dataset in this field we expect this dataset to be iterated upon.

\begin{itemize}
    \item The fatcat catalog updates its metadata
        continously\footnote{A changelog can currenly be followed here:
        \url{fatcat.wiki/changelog}} and web crawls are conducted regularly.
        Current processing pipelines cover raw reference snapshot creation and
        derivation the graph structure.

    \item Metadata extraction from PDFs depends on machine learning
        models, which in turn depend training sets. With additional crawls and
        metadata available we hope to improve models used for metadata
        extraction, reducing data extraction artifacts in the process.

    \item As of this version, a significant number of raw reference
        docs remain unmatched, which means that neither exact or fuzzy matching
        can recover a link to a known entity. On the one
        hand, this can hint at missing metadata. However, parts of the data
        will contain a reference to a catalogued entity, but in a specific,
        dense and harder to recover form.
    \end{itemize}

\section{Acknowledgements}

This work is partially supported by a grant from the \emph{Andrew W. Mellon
Foundation}. We like to thanks various teams at the Internet Archive for
providing necessary infrastructure, and also data processing expertise. We are
also indebted to various open source software tools and their maintainers as
well as open scholarly data projects - without those this work would be much
harder or not possible at all.


\section{Appendix A}
\begin{table}[]
    \footnotesize
    \begin{center}
\begin{tabular}{@{}rlll@{}}
\toprule
\textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason} \\ \midrule
934932865                  & crossref                  & exact                 & doi                   \\
151366108                  & fatcat-datacite           & exact                 & doi                   \\
65345275                   & fatcat-pubmed             & exact                 & pmid                  \\
48778607                   & fuzzy                     & strong                & jaccardauthors        \\
42465250                   & grobid                    & exact                 & doi                   \\
29197902                   & fatcat-pubmed             & exact                 & doi                   \\
19996327                   & fatcat-crossref           & exact                 & doi                   \\
11996694                   & fuzzy                     & strong                & slugtitleauthormatch  \\
9157498                    & fuzzy                     & strong                & tokenizedauthors      \\
3547594                    & grobid                    & exact                 & arxiv                 \\
2310025                    & fuzzy                     & exact                 & titleauthormatch      \\
1496515                    & grobid                    & exact                 & pmid                  \\
680722                     & crossref                  & strong                & jaccardauthors        \\
476331                     & fuzzy                     & strong                & versioneddoi          \\
449271                     & grobid                    & exact                 & isbn                  \\
230645                     & fatcat-crossref           & strong                & jaccardauthors        \\
190578                     & grobid                    & strong                & jaccardauthors        \\
156657                     & crossref                  & exact                 & isbn                  \\
123681                     & fatcat-pubmed             & strong                & jaccardauthors        \\
79328                      & crossref                  & exact                 & arxiv                 \\
57414                      & crossref                  & strong                & tokenizedauthors      \\
53480                      & fuzzy                     & strong                & pmiddoipair           \\
52453                      & fuzzy                     & strong                & dataciterelatedid     \\
47119                      & grobid                    & strong                & slugtitleauthormatch  \\
36774                      & fuzzy                     & strong                & arxivversion          \\
35311                      & fuzzy                     & strong                & customieeearxiv       \\
33863                      & grobid                    & exact                 & pmcid                 \\
23504                      & crossref                  & strong                & slugtitleauthormatch  \\
22753                      & fatcat-crossref           & strong                & tokenizedauthors      \\
17720                      & grobid                    & exact                 & titleauthormatch      \\
14656                      & crossref                  & exact                 & titleauthormatch      \\
14438                      & grobid                    & strong                & tokenizedauthors      \\
7682                       & fatcat-crossref           & exact                 & arxiv                 \\
5972                       & fatcat-crossref           & exact                 & isbn                  \\
5525                       & fatcat-pubmed             & exact                 & arxiv                 \\
4290                       & fatcat-pubmed             & strong                & tokenizedauthors      \\
2745                       & fatcat-pubmed             & exact                 & isbn                  \\
2342                       & fatcat-pubmed             & strong                & slugtitleauthormatch  \\
2273                       & fatcat-crossref           & strong                & slugtitleauthormatch  \\
1960                       & fuzzy                     & exact                 & workid                \\
1150                       & fatcat-crossref           & exact                 & titleauthormatch      \\
1041                       & fatcat-pubmed             & exact                 & titleauthormatch      \\
895                        & fuzzy                     & strong                & figshareversion       \\
317                        & fuzzy                     & strong                & titleartifact         \\
82                         & grobid                    & strong                & titleartifact         \\
33                         & crossref                  & strong                & titleartifact         \\
5                          & fuzzy                     & strong                & custombsiundated      \\
1                          & fuzzy                     & strong                & custombsisubdoc       \\
1                          & fatcat                    & exact                 & doi                   \\ \bottomrule
\end{tabular}
    \vspace*{2mm}
	\caption{Table of match counts, reference provenance, match status and
match reason. The match reason identifier encode a specific rule in the domain
dependent verification process and are included for completeness - we do not
include the details of each rule in this report.}
    \label{table:fields}
\end{center}
\end{table}

\bibliographystyle{abbrv}
\bibliography{refs}
\end{document}