aboutsummaryrefslogtreecommitdiffstats
path: root/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-09-07 13:50:38 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-09-07 13:50:38 +0200
commite8eb087c34c33b3532c2886ce49564fef9fa30fa (patch)
tree5c27dda24092ef0dac0ee0f343b9ee64359190a4 /docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
parent171678d962a49d5ae05e586702d09ccac3f08525 (diff)
downloadrefcat-e8eb087c34c33b3532c2886ce49564fef9fa30fa.tar.gz
refcat-e8eb087c34c33b3532c2886ce49564fef9fa30fa.zip
doc: tweak tr
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT/main.tex')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex78
1 files changed, 63 insertions, 15 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index 2a60a77..e2f59a0 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -79,8 +79,8 @@ projects like the Science Citation Index (1955)~\citep{garfield2007evolution}
were first devised, living on in existing commercial knowledge bases today.
Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
-references~\citep{shotton2013publishing}. Other notable early projects
-include CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
+references~\citep{shotton2013publishing}. Other notable projects
+include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
decade has seen the emergence of more openly available, large scale
citation projects like Microsoft Academic~\citep{sinha2015overview} or the
Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}.
@@ -94,22 +94,22 @@ manually or through more automated methods, such as metadata access and
structured data extraction from full text documents; the latter offering the
benefits of scalability. The completeness of bibliographic metadata ranges from
documents with one or more persistant identifiers to raw, potentially unclean
-strings partially describing a publication.
+strings partially describing a scholarly artifact.
\section{Related Work}
-Typical problems arising in the process of compiling a citation graph dataset
-are data aquisition and citation matching. Data acquisition itself can take
-different forms: bibliographic metadata can contain explicit reference data as
-provided by publishers and aggregators; this data can be relatively consistent
-when looked at per source, but may vary in style and comprehensiveness when
-looked at as a whole. Another way of acquiring bibliographic metadata is to
-analyze a source document, such as a PDF (or its text), directly. Tools in this
-category are often based on conditial random
+Two typical problems that arise in the process of compiling a citation graph
+dataset are related to data aquisition and citation matching. Data acquisition
+itself can take different forms: bibliographic metadata can contain explicit
+reference data as provided by publishers and aggregators; this data can be
+relatively consistent when looked at per source, but may vary in style and
+comprehensiveness when looked at as a whole. Another way of acquiring
+bibliographic metadata is to analyze a source document, such as a PDF (or its
+text), directly. Tools in this category are often based on conditial random
fields~\citep{lafferty2001conditional} and have been implemented in projects
such as ParsCit~\citep{councill2008parscit},
-Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite}
-or GROBID~\citep{lopez2009grobid}.
+Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite} or
+GROBID~\citep{lopez2009grobid}.
The problem of citation matching is relatively simple when common, persistent
identifiers are present in the data. Complications mount, when there is
@@ -123,7 +123,20 @@ citation matching process is done at scale~\citep{fedoryszak2013large}. The
problem of heterogenity has been discussed in the context of datasets
by~\citep{mathiak2015challenges}.
+Projects centered around citations or containing citation data as a core
+component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI
+citations'', which was first released
+2018-07-29\footnote{\url{https://opencitations.net/download}} and has been
+regularly updated~\citep{peroni2020opencitations}.
+The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+``a Wikimedia initiative to develop open citations and linked bibliographic
+data to serve free knowledge'' continously adds citations to its
+database\footnote{\url{http://wikicite.org/statistics.html}}.
+
+Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
+entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
+with \emph{PaperReferences} being one relation among many others.
% There are a few large scale citation dataset available today. COCI, the
@@ -215,6 +228,8 @@ seen in~Table~\ref{table:cocicmp}.
\section{System Design}
+\subsection{Constraints}
+
The constraints for the systems design are informed by the volume and the
variety of the data. The capability to run the whole graph derivation on a
single machine was a minor goal as well. In total, the raw inputs amount to a
@@ -256,6 +271,26 @@ Table~\ref{table:fields}.
\end{center}
\end{table}
+\subsection{Data Sources}
+
+Reference data comes from two main sources: explicit bibliographic metadata and
+PDF extraction. The bibliographic metadata is taken from fatcat, which itself
+harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv,
+Datacite, DOAJ, dblp and others into its catalog (as the source permits, data
+is processed continously or in batches). Reference data from PDF documents has
+been extracted with GROBID\footnote{GROBID v0.5.5}, with the TEI-XML results
+being cached locally in a key-value store accessible with an S3 API. Archived
+PDF documents result from dedicated web-scale crawls of scholarly domains
+conducted with
+Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} and a
+variety of seed lists targeting journal homepages, repositories, dataset
+providers, aggregators, web archives and other venues. A processing pipeline
+merges catalog data from the primary database and cached values in key-value
+stores and generates the set of about 2.5B references documents, which
+currently serve as an input for the citation graph derivation pipeline.
+
+\subsection{Methodology}
+
Overall, a map-reduce style~\citep{dean2010mapreduce} approach is
followed\footnote{While the operations are similar, the processing is not
distributed but runs on a single machine. For space efficiency, zstd~\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
@@ -264,7 +299,7 @@ uniformity in the overall processing. We extract (key, document) tuples (as
TSV) from the raw JSON data and sort by key. We then group documents with the
same key and apply a function on each group in order to generate
our target schema or perform
-additional operations such as deduplication or fusion of matched and unmatched references.
+additional operations such as deduplication or fusion of matched and unmatched references for indexing.
The key derivation can be exact (via an identifier like DOI, PMID, etc) or
based on a value normalization, like ``slugifying'' a title string. For identifier
@@ -277,6 +312,18 @@ are similar by various metrics calculated over title and author fields. The fuzz
approach is applied on all reference documents without identifier (a title is
currently required).
+We currently implement performance sensitive parts in
+Go\footnote{\url{https://golang.org/}}, with various processing stages (e.g.
+conversion, map, reduce, ...) represented by separate command line tools. A
+thin task orchestration layer using the luigi
+framework\footnote{\url{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani},
+ which has been used in various scientific pipeline
+ application, like~\citep{schulz2016use},~\citep{erdmann2017design},~\citep{lampa2019scipipe},~\citep{czygan2014design}
+ and others.} allows for experimentation in the pipeline and for single command
+derivations, as data dependencies are encoded with the help of the
+orchestrator. Within the tasks, we also utilize classic platfrom tools such as
+sort~\citep{mcilroy1971research}.
+
With a few schema conversions, fuzzy matching can be applied to Wikipedia
articles and Open Library (edition) records as well. The aspect of precision
and recall are represented by the two stages: we are generous in the match
@@ -288,6 +335,7 @@ their expected or desired match status\footnote{The list can be found under:
It is helpful to keep this test suite independent of any specific programming language.}.
+
\section{Limitations and Future Work}
As other dataset in this field we expect this dataset to be iterated upon.
@@ -295,7 +343,7 @@ As other dataset in this field we expect this dataset to be iterated upon.
\begin{itemize}
\item The fatcat catalog updates its metadata
continously\footnote{A changelog can currenly be followed here:
- \url{https://fatcat.wiki/changelog}} and web crawls are conducted
+ \url{https://fatcat.wiki/changelog}.} and web crawls are conducted
regularly. Current processing pipelines cover raw reference snapshot
creation and derivation of the graph structure, which allows to rerun
processing based on updated data as it becomes available.