From 3dede8e214168ec628180f31542ac3132b8e2338 Mon Sep 17 00:00:00 2001
From: Martin Czygan <martin.czygan@gmail.com>
Date: Mon, 6 Sep 2021 14:18:37 +0200
Subject: docs: start addressing feedback from MR#4

---
 docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | Bin 95948 -> 96426 bytes
 docs/TR-20210808100000-IA-WDS-REFCAT/main.tex |  48 +++++++++++++++++---------
 docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib |  31 +++++++++++++++++
 3 files changed, 62 insertions(+), 17 deletions(-)

(limited to 'docs')

diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index 75a5449..9fafcc0 100644
Binary files a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf and b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf differ
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index a5536d8..ab72699 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -47,7 +47,7 @@
 	crawls targeting primary and secondary scholarly outputs, as well as metadata
 	from the Open Library\footnote{\url{https://openlibrary.org}} project and
 	Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
-	graph consists of 1,323,423,672 citations. We release this dataset under a CC0
+	graph consists of over 1.3B citations. We release this dataset under a CC0
 	Public Domain Dedication, accessible through an archive
 	item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}.
 	The source code used for the derivation process, including exact and fuzzy
@@ -59,28 +59,42 @@
 
 \section{Introduction}
 
-
 The Internet Archive releases a first version of a citation graph dataset
 derived from a raw corpus of about 2.5B references gathered from metadata and
-data obtained by PDF extraction tools such as
-GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
+data obtained by PDF extraction and annotation tools such as
+GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with
 metadata from Open Library and Wikipedia.
 The goal of this report is to describe briefly the current contents and the
 derivation of the dataset. We expect
 this dataset to be iterated upon, with changes both in content and processing.
 
+According to~\citep{jinha_2010} over 50M scholarly articles have been published
+(from 1726) up to 2009, with the rate of publications on the
+rise~\citep{landhuis_2016}. In 2014, a study based on academic search engines
+estimated that at least 114M English-language scholarly documents are
+accessible on the web~\citep{khabsa_giles_2014}.
+
 Modern citation indexes can be traced back to the early computing age, when
-projects like the Science Citation Index (1955)\citep{garfield2007evolution}
+projects like the Science Citation Index (1955)~\citep{garfield2007evolution}
 were first devised, living on in existing commercial knowledge bases today.
 Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
 - the first version of which contained 6,325,178 individual
-references\citep{shotton2013publishing}. Other notable early projects
-include CiteSeerX\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
+references~\citep{shotton2013publishing}. Other notable early projects
+include CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
 decade has seen the emergence of more openly available, large scale
-citation projects like Microsoft Academic\citep{sinha2015overview} or the
-Initiative for Open Citations\footnote{\url{https://i4oc.org}}\citep{shotton2018funders}.
+citation projects like Microsoft Academic~\citep{sinha2015overview} or the
+Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}.
 In 2021, over one billion citations are publicly available, marking a ``tipping point''
-for this category of data\citep{hutchins2021tipping}.
+for this category of data~\citep{hutchins2021tipping}.
+
+While a paper will mainly cite other papers, more citable entities exist such
+as books and web links and within links a variety of targets, such as web
+sites, reference entries, protocols or datasets. References can be extracted
+manually or through more automated methods, such as metadata access and
+structured data extraction from full text documents; the latter offering the
+benefits of scalability. The completeness of bibliographic metadata ranges from
+documents with one or more persistant identifiers to raw, potentially unclean
+strings partially describing a publication.
 
 \section{Related Work}
 
@@ -89,7 +103,7 @@ There are a few large scale citation dataset available today. COCI, the
 released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
 2021-07-29, it contains
 1,094,394,688 citations across 65,835,422 bibliographic
-resources\citep{peroni2020opencitations}.
+resources~\citep{peroni2020opencitations}.
 
 The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
 ``a Wikimedia initiative to develop open citations and linked bibliographic
@@ -97,7 +111,7 @@ data to serve free knowledge'' continously adds citations to its database and
 as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
 publications\footnote{\url{http://wikicite.org/statistics.html}}.
 
-Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of
+Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
 entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
 with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
 	\url{https://archive.org/details/mag-2021-06-07}}  the
@@ -106,9 +120,9 @@ bibliographic entities.
 
 Numerous other projects have been or are concerned with various aspects of
 citation discovery and curation as part their feature set, among them Semantic
-Scholar\citep{fricke2018semantic}, CiteSeerX\citep{li2006citeseerx} or Aminer\citep{tang2016aminer}.
+Scholar~\citep{fricke2018semantic}, CiteSeerX~\citep{li2006citeseerx} or Aminer~\citep{tang2016aminer}.
 
-As mentioned in \citep{hutchins2021tipping}, the number of openly available
+As mentioned in~\citep{hutchins2021tipping}, the number of openly available
 citations is not expected to shrink in the future.
 
 
@@ -120,7 +134,7 @@ or \emph{bref} for short). The dataset includes metadata from fatcat, the
 Open Library project and inbound links from the English Wikipedia. The fatcat
 project itself aggregates data from variety of open data sources, such as
 Crossref\footnote{\url{https://crossref.org}}, PubMed\footnote{\url{https://pubmed.ncbi.nlm.nih.gov/}},
-DataCite\footnote{\url{https://datacite.org}}, Directory of Open Access Jourals (DOAJ)\footnote{\url{https://doaj.org}}, dblp\citep{ley2002dblp} and others,
+DataCite\footnote{\url{https://datacite.org}}, Directory of Open Access Jourals (DOAJ)\footnote{\url{https://doaj.org}}, dblp~\citep{ley2002dblp} and others,
 as well as metadata generated from analysis of data preserved at the Internet
 Archive and active crawls of publication sites on the web.
 
@@ -214,9 +228,9 @@ Table~\ref{table:fields}.
 	\end{center}
 \end{table}
 
-Overall, a map-reduce style\citep{dean2010mapreduce} approach is
+Overall, a map-reduce style~\citep{dean2010mapreduce} approach is
 followed\footnote{While the operations are similar, the processing is not
-	distributed but runs on a single machine. For space efficiency, zstd\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
+	distributed but runs on a single machine. For space efficiency, zstd~\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
 for some
 uniformity in the overall processing. We extract (key, document) tuples (as
 TSV) from the raw JSON data and sort by key. We then group documents with the
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib b/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib
index 4dd7f6a..f927ea4 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib
@@ -240,3 +240,34 @@
 	pages={99--108},
 	year={2018}
 }
+
+@article{jinha_2010,
+	title={Article 50 million: an estimate of the number of scholarly articles in existence},
+	volume={23},
+	DOI={10.1087/20100308},
+	publisher={Wiley},
+	author={Jinha},
+	year={2010},
+	month={Jul}
+}
+
+@article{landhuis_2016,
+	title={Scientific literature: Information overload},
+	volume={535},
+	DOI={10.1038/nj7612-457a},
+	number={7612},
+	publisher={Springer Nature},
+	author={Landhuis},
+	year={2016},
+	month={Jul}
+}
+
+@article{khabsa_giles_2014,
+	title={The Number of Scholarly Documents on the Public Web},
+	DOI={10.1371/journal.pone.0093949},
+	publisher={Public Library of Science (PLoS)},
+	author={Khabsa and Giles},
+	editor={Zhang},
+	year={2014},
+	month={May}
+}
-- 
cgit v1.2.3