aboutsummaryrefslogtreecommitdiffstats
path: root/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
diff options
context:
space:
mode:
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT/main.tex')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex86
1 files changed, 57 insertions, 29 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index ab72699..2a60a77 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -87,9 +87,9 @@ Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton201
In 2021, over one billion citations are publicly available, marking a ``tipping point''
for this category of data~\citep{hutchins2021tipping}.
-While a paper will mainly cite other papers, more citable entities exist such
-as books and web links and within links a variety of targets, such as web
-sites, reference entries, protocols or datasets. References can be extracted
+While a paper will often cite other papers, more citable entities exist such
+as books or web links and within links a variety of targets, such as web
+pages, reference entries, protocols or datasets. References can be extracted
manually or through more automated methods, such as metadata access and
structured data extraction from full text documents; the latter offering the
benefits of scalability. The completeness of bibliographic metadata ranges from
@@ -98,32 +98,60 @@ strings partially describing a publication.
\section{Related Work}
-There are a few large scale citation dataset available today. COCI, the
-``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
-released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
-2021-07-29, it contains
-1,094,394,688 citations across 65,835,422 bibliographic
-resources~\citep{peroni2020opencitations}.
-
-The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
-``a Wikimedia initiative to develop open citations and linked bibliographic
-data to serve free knowledge'' continously adds citations to its database and
-as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
-publications\footnote{\url{http://wikicite.org/statistics.html}}.
-
-Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
-entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
-with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
- \url{https://archive.org/details/mag-2021-06-07}} the
-\emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466
-bibliographic entities.
-
-Numerous other projects have been or are concerned with various aspects of
-citation discovery and curation as part their feature set, among them Semantic
-Scholar~\citep{fricke2018semantic}, CiteSeerX~\citep{li2006citeseerx} or Aminer~\citep{tang2016aminer}.
-
-As mentioned in~\citep{hutchins2021tipping}, the number of openly available
-citations is not expected to shrink in the future.
+Typical problems arising in the process of compiling a citation graph dataset
+are data aquisition and citation matching. Data acquisition itself can take
+different forms: bibliographic metadata can contain explicit reference data as
+provided by publishers and aggregators; this data can be relatively consistent
+when looked at per source, but may vary in style and comprehensiveness when
+looked at as a whole. Another way of acquiring bibliographic metadata is to
+analyze a source document, such as a PDF (or its text), directly. Tools in this
+category are often based on conditial random
+fields~\citep{lafferty2001conditional} and have been implemented in projects
+such as ParsCit~\citep{councill2008parscit},
+Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite}
+or GROBID~\citep{lopez2009grobid}.
+
+The problem of citation matching is relatively simple when common, persistent
+identifiers are present in the data. Complications mount, when there is
+\emph{Identity Uncertainty}, that is ``objects are not labeled with unique
+identifiers or when those identifiers may not be perceived
+perfectly''~\citep{pasula2003identity}. CiteSeer has been an early project
+concerned with citation matching~\citep{giles1998citeseer}. A taxonomy of
+potential issues common in the matching process has been compiled
+by~\citep{olensky2016evaluation}. Additional care is required, when the
+citation matching process is done at scale~\citep{fedoryszak2013large}. The
+problem of heterogenity has been discussed in the context of datasets
+by~\citep{mathiak2015challenges}.
+
+
+
+
+% There are a few large scale citation dataset available today. COCI, the
+% ``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
+% released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
+% 2021-07-29, it contains
+% 1,094,394,688 citations across 65,835,422 bibliographic
+% resources~\citep{peroni2020opencitations}.
+%
+% The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+% ``a Wikimedia initiative to develop open citations and linked bibliographic
+% data to serve free knowledge'' continously adds citations to its database and
+% as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
+% publications\footnote{\url{http://wikicite.org/statistics.html}}.
+%
+% Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
+% entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
+% with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
+% \url{https://archive.org/details/mag-2021-06-07}} the
+% \emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466
+% bibliographic entities.
+%
+% Numerous other projects have been or are concerned with various aspects of
+% citation discovery and curation as part their feature set, among them Semantic
+% Scholar~\citep{fricke2018semantic}, CiteSeerX~\citep{li2006citeseerx} or Aminer~\citep{tang2016aminer}.
+%
+% As mentioned in~\citep{hutchins2021tipping}, the number of openly available
+% citations is not expected to shrink in the future.
\section{Dataset}