aboutsummaryrefslogtreecommitdiffstats
path: root/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
diff options
context:
space:
mode:
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT/main.tex')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex62
1 files changed, 33 insertions, 29 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index 1b834f9..35d73b1 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -65,7 +65,7 @@
\section{Introduction}
The Internet Archive releases a first version of a citation graph dataset
-derived from a raw corpus of about 2.5B raw references gathered from metadata
+derived from a corpus of about 2.5B raw references gathered from metadata
and data obtained by PDF extraction and annotation tools such as
GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with
metadata from Open Library and Wikipedia. We expect this dataset to be
@@ -79,16 +79,18 @@ accessible on the web~\citep{khabsa_giles_2014}.
Modern citation indexes can be traced back to the early computing age, when
projects like the Science Citation Index (1955)~\citep{garfield2007evolution}
-were first devised, living on in commercial knowledge bases today.
-Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
-- the first version of which contained 6,325,178 individual
-references~\citep{shotton2013publishing}. Other notable projects
-include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\href{https://citec.repec.org}{https://citec.repec.org}}. The last
-decade has seen the emergence of more openly available, large scale
+were first devised, living on in commercial knowledge bases today. Open
+alternatives were started such as the Open Citations Corpus (OCC) in 2010 - the
+first version of which contained 6,325,178 individual
+references~\citep{shotton2013publishing}. Other notable projects include
+CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and
+CitEc\footnote{\href{https://citec.repec.org}{https://citec.repec.org}}. The
+last decade has seen the emergence of more openly available, large scale
citation projects like Microsoft Academic~\citep{sinha2015overview} and the
-Initiative for Open Citations\footnote{\href{https://i4oc.org}{https://i4oc.org}}~\citep{shotton2018funders}.
-In 2021, over one billion citations are publicly available, marking a ``tipping point''
-for this category of data~\citep{hutchins2021tipping}.
+Initiative for Open
+Citations\footnote{\href{https://i4oc.org}{https://i4oc.org}}~\citep{shotton2018funders}.
+In 2021, over one billion citations are publicly available, marking a ``tipping
+point'' for this category of data~\citep{hutchins2021tipping}.
While a paper will often cite other papers, more citable entities exist such as
books or web links and within links a variety of targets, such as web pages,
@@ -101,17 +103,17 @@ unclean strings partially describing a scholarly artifact.
\section{Related Work}
-Two typical problems in citation graph development are related to data aquisition and citation matching. Data acquisition
-itself can take different forms: bibliographic metadata can contain explicit
-reference data as provided by publishers and aggregators; this data can be
-relatively consistent when looked at per source, but may vary in style and
-comprehensiveness when looked at as a whole. Another way of acquiring
-bibliographic metadata is to analyze a source document, such as a PDF (or its
-text), directly. Tools in this category are often based on conditional random
-fields~\citep{lafferty2001conditional} and have been implemented in projects
-such as ParsCit~\citep{councill2008parscit},
-Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite} or
-GROBID~\citep{lopez2009grobid}.
+Two typical problems in citation graph development are related to data
+aquisition and citation matching. Data acquisition itself can take different
+forms: bibliographic metadata can contain explicit reference data as provided
+by publishers and aggregators; this data can be relatively consistent when
+looked at per source, but may vary in style and comprehensiveness when looked
+at as a whole. Another way of acquiring bibliographic metadata is to analyze a
+source document, such as a PDF (or its text), directly. Tools in this category
+are often based on conditional random fields~\citep{lafferty2001conditional}
+and have been implemented in projects such as
+ParsCit~\citep{councill2008parscit}, Cermine~\citep{tkaczyk2014cermine},
+EXCITE~\citep{hosseini2019excite} or GROBID~\citep{lopez2009grobid}.
The problem of citation matching is relatively simple when common, persistent
identifiers are present in the data. Complications mount, when there is
@@ -125,14 +127,16 @@ citation matching process is done at scale~\citep{fedoryszak2013large}. The
problem of heterogenity has been discussed in the context of datasets
by~\citep{mathiak2015challenges}.
-Projects centered around citations or containing citation data as a core
-component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI
-citations'', which was first released
-2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}} and has been
-regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}} project,
-``a Wikimedia initiative to develop open citations and linked bibliographic
-data to serve free knowledge'' continously adds citations to its
-database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
+Projects and datasets centered around citations or containing citation data as
+a core component are COCI, the ``OpenCitations Index of Crossref open
+DOI-to-DOI citations'', which was first released
+2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}}
+and has been regularly updated since~\citep{peroni2020opencitations}. The
+WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}}
+project, ``a Wikimedia initiative to develop open citations and linked
+bibliographic data to serve free knowledge'' continously adds citations to its
+database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}.
+Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
entities\footnote{\href{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
with \emph{PaperReferences} being one relation among many others.