aboutsummaryrefslogtreecommitdiffstats
path: root/docs/TR-20210808100000-IA-WDS-REFCAT
diff options
context:
space:
mode:
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.pdfbin140069 -> 140144 bytes
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex49
2 files changed, 25 insertions, 24 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index 830f25f..be9bda0 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
Binary files differ
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index 0543612..7ac8e46 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -17,7 +17,7 @@
\begin{document}
-\title{Refcat: The Fatcat Citation Graph}
+\title{Refcat: The Internet Archive Scholar Citation Graph}
\author{Martin Czygan \\
\\
@@ -39,20 +39,20 @@
\begin{abstract}
- As part of its scholarly data efforts, the Internet Archive releases a
+ As part of its scholarly data efforts, the Internet Archive (IA) releases a
first version of a citation graph dataset, named \emph{refcat}, derived
from scholarly publications and additional data sources. It is composed of
data gathered by the fatcat cataloging
- project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related
+ project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}} (the catalog that underpins IA Scholar), related
web-scale crawls targeting primary and secondary scholarly outputs, as well
as metadata from the Open
Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}}
project and
Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}.
This first version of the graph consists of over 1.3B citations. We release
- this dataset under a CC0 Public Domain Dedication, accessible through an
- archive
- item\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}.
+ this dataset under a CC0 Public Domain Dedication, accessible through
+ Internet
+ Archive\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}.
The source code used for the derivation process, including exact and fuzzy
citation matching, is released under an MIT
license\footnote{\href{https://gitlab.com/internetarchive/refcat}{https://gitlab.com/internetarchive/refcat}}.
@@ -64,7 +64,7 @@
\section{Introduction}
-The Internet Archive releases a first version of a citation graph dataset
+The Internet Archive released a first version of a citation graph dataset
derived from a corpus of about 2.5B raw references gathered from metadata
and data obtained by PDF extraction and annotation tools such as
GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with
@@ -131,10 +131,10 @@ Projects and datasets centered around citations or containing citation data as
a core component are COCI, the ``OpenCitations Index of Crossref open
DOI-to-DOI citations'', which was first released
2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}}
-and has been regularly updated since~\citep{peroni2020opencitations}. The
+and has been regularly updated~\citep{peroni2020opencitations}. The
WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}}
project, ``a Wikimedia initiative to develop open citations and linked
-bibliographic data to serve free knowledge'' continously adds citations to its
+bibliographic data to serve free knowledge'' continuously adds citations to its
database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}.
Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
entities\footnote{\href{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
@@ -171,15 +171,16 @@ with \emph{PaperReferences} being one relation among many others.
\section{Dataset}
-We release the first version of the \emph{refcat} dataset in a format used
+We released the first version of the \emph{refcat} dataset in a format used
internally for storage and to serve queries (and which we call \emph{biblioref}
-or \emph{bref} for short). The dataset includes metadata from fatcat, the Open
-Library project and inbound links from the English Wikipedia. The dataset is
-integrated into the \href{https://fatcat.wiki}{fatcat.wiki website} and allows users
-to explore inbound and outbound
+or \emph{bref} for short). The dataset includes metadata from fatcat (the
+catalog underpinning IA Scholar), the Open Library project and inbound links
+from the English Wikipedia. The dataset is integrated into the
+\href{https://fatcat.wiki}{fatcat.wiki website} and allows users to explore
+inbound and outbound
references\footnote{\href{https://guide.fatcat.wiki/reference\_graph.html}{https://guide.fatcat.wiki/reference\_graph.html}}.
-The format records source and target (fatcat release and work) identifiers, a
+The format records source and target identifiers, a
few metadata attributes (such as year or release stage) as well as
information about the match status and provenance.
@@ -196,16 +197,16 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
\begin{table}[]
\begin{center}
- \begin{tabular}{ll}
+ \begin{tabular}{lll}
\toprule
- \bf{Set} & \bf{Count} \\
+ \bf{Set} & & \bf{Count} \\
\midrule
- COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
- \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
- C $\cap$ R & 1,046,438,515 \\
- C $\setminus$ R & 140,520,382 \\ % 86,854,309 \\
- R $\setminus$ C & 256,985,697 \\ % xxx 295,884,246
+ COCIv11 (C) & & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
+ \emph{refcat-doi} (R) & & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
+ C $\cap$ R & overlap & 1,046,438,515 \\
+ C $\setminus$ R & COCIv11 only & 140,520,382 \\ % 86,854,309 \\
+ R $\setminus$ C & refcat-doi only & 256,985,697 \\ % xxx 295,884,246
\end{tabular}
\vspace*{2mm}
\caption{Comparison between Open Citations COCI corpus (v11,
@@ -251,7 +252,7 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
\end{table}
We started to include non-traditional citations into the graph, such as links
-to books as recorded by the Open Library project and links from the English
+to books included in Open Library and links from the English
Wikipedia to scholarly works. For links between Open Library we employ both
identifier based and fuzzy matching; for Wikipedia references we used a published dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing
to upstream projects related to wikipedia citation extraction, such as
@@ -329,7 +330,7 @@ Reference data comes from two main sources: explicit bibliographic metadata and
PDF extraction. The bibliographic metadata is taken from fatcat, which itself
harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv,
Datacite, DOAJ, dblp and others into its catalog (as the source permits, data
-is processed continously or in batches). Reference data from PDF documents has
+is processed continuously or in batches). Reference data from PDF documents has
been extracted with GROBID\footnote{GROBID
\href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the
TEI-XML results being cached locally in a key-value store accessible with an S3