aboutsummaryrefslogtreecommitdiffstats
path: root/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
diff options
context:
space:
mode:
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT/main.tex')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex83
1 files changed, 45 insertions, 38 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index b020a47..c950f61 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -39,19 +39,25 @@
\begin{abstract}
- As part of its scholarly data efforts, the Internet Archive releases a first version of a citation
- graph dataset, named \emph{refcat}, derived from scholarly publications and
- additional data sources. It is composed of data gathered by the fatcat
- cataloging project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related web-scale
- crawls targeting primary and secondary scholarly outputs, as well as metadata
- from the Open Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}} project and
- Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}. This first version of the
- graph consists of over 1.3B citations. We release this dataset under a CC0
- Public Domain Dedication, accessible through an archive
+ As part of its scholarly data efforts, the Internet Archive releases a
+ first version of a citation graph dataset, named \emph{refcat}, derived
+ from scholarly publications and additional data sources. It is composed of
+ data gathered by the fatcat cataloging
+ project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related
+ web-scale crawls targeting primary and secondary scholarly outputs, as well
+ as metadata from the Open
+ Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}}
+ project and
+ Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}.
+ This first version of the graph consists of over 1.3B citations. We release
+ this dataset under a CC0 Public Domain Dedication, accessible through an
+ archive
item\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}.
The source code used for the derivation process, including exact and fuzzy
citation matching, is released under an MIT
license\footnote{\href{https://gitlab.com/internetarchive/refcat}{https://gitlab.com/internetarchive/refcat}}.
+ The goal of this report is to describe briefly the current contents and the
+ derivation of the dataset.
\end{abstract}
\keywords{Citation Graph, Web Archiving}
@@ -59,13 +65,11 @@
\section{Introduction}
The Internet Archive releases a first version of a citation graph dataset
-derived from a raw corpus of about 2.5B references gathered from metadata and
-data obtained by PDF extraction and annotation tools such as
+derived from a raw corpus of about 2.5B raw references gathered from metadata
+and data obtained by PDF extraction and annotation tools such as
GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with
-metadata from Open Library and Wikipedia.
-The goal of this report is to describe briefly the current contents and the
-derivation of the dataset. We expect
-this dataset to be iterated upon, with changes both in content and processing.
+metadata from Open Library and Wikipedia. We expect this dataset to be
+iterated upon, with changes both in content and processing.
According to~\citep{jinha_2010} over 50M scholarly articles have been published
(from 1726) up to 2009, with the rate of publications on the
@@ -86,14 +90,14 @@ Initiative for Open Citations\footnote{\href{https://i4oc.org}{https://i4oc.org}
In 2021, over one billion citations are publicly available, marking a ``tipping point''
for this category of data~\citep{hutchins2021tipping}.
-While a paper will often cite other papers, more citable entities exist such
-as books or web links and within links a variety of targets, such as web
-pages, reference entries, protocols or datasets. References can be extracted
-manually or through more automated methods, by accessing relevant metadata or
-structured data extraction from full text documents. Automated methods offer the
-benefits of scalability. The completeness of bibliographic metadata ranges from
-documents with one or more persistant identifiers to raw, potentially unclean
-strings partially describing a scholarly artifact.
+While a paper will often cite other papers, more citable entities exist such as
+books or web links and within links a variety of targets, such as web pages,
+reference entries, protocols or datasets. References can be extracted manually
+or through more automated methods, by accessing relevant metadata or structured
+data extraction from full text documents. Automated methods offer the benefits
+of scalability. The completeness of bibliographic metadata in references ranges
+from documents with one or more persistant identifiers to raw, potentially
+unclean strings partially describing a scholarly artifact.
\section{Related Work}
@@ -163,7 +167,7 @@ with \emph{PaperReferences} being one relation among many others.
\section{Dataset}
-We release the first version of the \emph{refcat} dataset in an format used
+We release the first version of the \emph{refcat} dataset in a format used
internally for storage and to serve queries (and which we call \emph{biblioref}
or \emph{bref} for short). The dataset includes metadata from fatcat, the Open
Library project and inbound links from the English Wikipedia. The dataset is
@@ -257,13 +261,16 @@ web links\footnote{The cleaning process is necessary because OCR artifacts and
reference corpus, of which 4,827,688 have been preserved as of August 2021 with an HTTP 200
status code in the Wayback
Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of
-the Internet Archive. From a sample\footnote{In a sample of 8000 links we find
- only 6138 responding with a HTTP 200, whereas the rest of the links yields a
- variety of http status codes, like 404, 403, 500 and others.} we observe, that
-about 23\% of the links in the reference corpus preserved at the Internet
-Archive are not accessible on the world wide web currently\footnote{We used the \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command line link checking tool.} - making targeted
-web crawling and preservation of scholarly references a key activity for
-maintaining citation integrity.
+the Internet Archive.
+
+In a random sample of 8000 links we find only 6138 responding
+with an HTTP 200 OK, whereas the rest of the links yield a variety of HTTP status
+codes, like 404, 403, 500 and others - resulting in 23\% of the links
+in the reference corpus preserved at the Internet Archive being currently inaccessible on
+the web\footnote{We used the
+ \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command
+ line link checking tool.} - making targeted web crawling and preservation of
+scholarly references a key activity for maintaining citation integrity.
% unpigz -c fatcat-refs-urllist-2021-06-17_lookup-20210714045637.tsv.gz| LC_ALL=C grep -F ')/' | grep -c -E "\W200\W"
@@ -331,7 +338,7 @@ Heritrix\footnote{\href{https://github.com/internetarchive/heritrix3}{https://gi
homepages, repositories, dataset providers, aggregators, web archives and other
venues. A processing pipeline merges catalog data from the primary database and
cached data from the key-value store and generates the set of about 2.5B
-references documents, which currently serve as an input for the citation graph
+references records, which currently serve as an input for the citation graph
derivation pipeline.
\subsection{Methodology}
@@ -340,7 +347,7 @@ Overall, a map-reduce style~\citep{dean2010mapreduce} approach is
followed\footnote{While the operations are similar, the processing is not
distributed but runs on a single machine. For space efficiency, zstd~\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
for some
-uniformity in the overall processing. We extract (key, document) tuples (as
+uniformity in the processing. We extract \emph{(key, document)} tuples (as
TSV) from the raw JSON data and sort by key. We then group documents with the
same key and apply a function on each group in order to generate
our target schema or perform
@@ -357,8 +364,8 @@ are similar by various metrics calculated over title and author fields. The fuzz
approach is applied on all reference documents without identifier (a title is
currently required).
-We currently implement performance sensitive parts in
-Go\footnote{\href{https://golang.org/}{https://golang.org/}}, with various processing stages (e.g.
+We currently implement performance sensitive parts in the
+Go programming language\footnote{\href{https://golang.org/}{https://golang.org/}}, with various processing stages (e.g.
conversion, map, reduce, ...) represented by separate command line tools. A
thin task orchestration layer using the luigi
framework\footnote{\href{https://github.com/spotify/luigi}{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani},
@@ -367,7 +374,7 @@ framework\footnote{\href{https://github.com/spotify/luigi}{https://github.com/sp
and others.} allows for experimentation in the pipeline and for single command
derivations, as data dependencies are encoded with the help of the
orchestrator. Within the tasks, we also utilize classic platform tools such as
-\emph{sort}~\citep{mcilroy1971research}.
+GNU \emph{sort}~\citep{mcilroy1971research}.
During a last processing step, we fuse reference matches and unmatched items
into a single, indexable file. This step includes deduplication of different
@@ -376,7 +383,7 @@ indexed into an search index and serves both matched and unmatched references
for the web application, allowing for further collection of feedback on match
quality and possible improvements.
-With a few schema conversions, fuzzy matching can be applied to Wikipedia
+With a few schema conversions, fuzzy matching has been be applied to Wikipedia
articles and Open Library (edition) records as well. The aspect of precision
and recall are represented by the two stages: we are generous in the match
candidate generation phase in order to improve recall, but we are strict during
@@ -419,7 +426,7 @@ As other dataset in this field we expect this dataset to be iterated upon.
\section{Acknowledgements}
-This work is partially supported by a grant from the \emph{Andrew W. Mellon
+This work is partially supported by a grant (1910-07256) from the \emph{Andrew W. Mellon
Foundation}.