aboutsummaryrefslogtreecommitdiffstats
path: root/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-09-08 00:25:18 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-09-08 00:25:18 +0200
commit517c19160d5f01a326da88174e55f700b83ceb87 (patch)
treed23566d23ee587d69d523c86a2c6091a6f7ab5ca /docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
parent9a75e6d549d36b68e7f58c9c1494a6d89071bf90 (diff)
downloadrefcat-517c19160d5f01a326da88174e55f700b83ceb87.tar.gz
refcat-517c19160d5f01a326da88174e55f700b83ceb87.zip
doc: tweaks
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT/main.tex')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex59
1 files changed, 26 insertions, 33 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index e99ddc3..e0fbc69 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -11,7 +11,6 @@
\usepackage{amsfonts} % blackboard math symbols
\usepackage{nicefrac} % compact symbols for 1/2, etc.
\usepackage{caption}
-
\usepackage{datetime}
\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
\setlength{\parindent}{0pt}
@@ -105,7 +104,7 @@ reference data as provided by publishers and aggregators; this data can be
relatively consistent when looked at per source, but may vary in style and
comprehensiveness when looked at as a whole. Another way of acquiring
bibliographic metadata is to analyze a source document, such as a PDF (or its
-text), directly. Tools in this category are often based on conditial random
+text), directly. Tools in this category are often based on conditional random
fields~\citep{lafferty2001conditional} and have been implemented in projects
such as ParsCit~\citep{councill2008parscit},
Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite} or
@@ -127,7 +126,7 @@ Projects centered around citations or containing citation data as a core
component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI
citations'', which was first released
2018-07-29\footnote{\url{https://opencitations.net/download}} and has been
-regularly updated~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
``a Wikimedia initiative to develop open citations and linked bibliographic
data to serve free knowledge'' continously adds citations to its
database\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
@@ -167,17 +166,11 @@ with \emph{PaperReferences} being one relation among many others.
We release the first version of the \emph{refcat} dataset in an format used
internally for storage and to serve queries (and which we call \emph{biblioref}
-or \emph{bref} for short). The dataset includes metadata from fatcat, the
-Open Library project and inbound links from the English Wikipedia. The fatcat
-project itself aggregates data from variety of open data sources, such as
-Crossref\footnote{\url{https://crossref.org}}, PubMed\footnote{\url{https://pubmed.ncbi.nlm.nih.gov/}},
-DataCite\footnote{\url{https://datacite.org}}, Directory of Open Access Jourals (DOAJ)\footnote{\url{https://doaj.org}}, dblp~\citep{ley2002dblp} and others,
-as well as metadata generated from analysis of data preserved at the Internet
-Archive and active crawls of publication sites on the web.
-
-The dataset is
+or \emph{bref} for short). The dataset includes metadata from fatcat, the Open
+Library project and inbound links from the English Wikipedia. The dataset is
integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users
-to explore inbound and outbound references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}.
+to explore inbound and outbound
+references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}.
The format records source and target (fatcat release and work) identifiers, a
few attributes from the metadata (such as year or release stage) as well as
@@ -186,13 +179,11 @@ information about the match status and provanance.
The dataset currently contains 1,323,423,672 citations across 76,327,662
entities (55,123,635 unique source and 60,244,206 unique target work
identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI
-for both source and target).
-The majority of matches - 1,250,523,321 - are established through identifier
-based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are
-established through fuzzy matching techniques.
-
-The majority of citations between \emph{refcat} and COCI overlap, as can be
-seen in~Table~\ref{table:cocicmp}.
+for both source and target). The majority of matches - 1,250,523,321 - are
+established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN).
+72,900,351 citations are established through fuzzy matching techniques. The
+majority of citations between COCI and \emph{refcat} overlap, as can be seen
+in~Table~\ref{table:cocicmp}.
\begin{table}[]
\begin{center}
@@ -228,11 +219,11 @@ seen in~Table~\ref{table:cocicmp}.
\subsection{Constraints}
-The constraints for the systems design are informed by the volume and the
+The constraints for the system design are informed by the volume and the
variety of the data. The capability to run the whole graph derivation on a
single machine was a minor goal as well. In total, the raw inputs amount to a
few terabytes of textual content, mostly newline delimited JSON. More
-importantly, while the number of data fields is low, certain schemas are very
+importantly, while the number of data fields is low, certain documents are very
partial with hundreds of different combinations of available field values found
in the raw reference data. This is most likely caused by aggregators passing on
reference data coming from hundreds of sources, each of which not necessarily
@@ -242,7 +233,7 @@ machine learning based structured data extraction tools.
Each combination of fields may require a slightly different processing path.
For example, references with an Arxiv identifier can be processed differently
from references with only a title. Over 50\% of the raw reference data comes
-from a set of eight field set manifestations, as listed in
+from a set of eight field set variants, as listed in
Table~\ref{table:fields}.
\begin{table}[]
@@ -276,16 +267,18 @@ PDF extraction. The bibliographic metadata is taken from fatcat, which itself
harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv,
Datacite, DOAJ, dblp and others into its catalog (as the source permits, data
is processed continously or in batches). Reference data from PDF documents has
-been extracted with GROBID\footnote{GROBID \href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the TEI-XML results
-being cached locally in a key-value store accessible with an S3 API. Archived
-PDF documents result from dedicated web-scale crawls of scholarly domains
-conducted with
-Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} and a
-variety of seed lists targeting journal homepages, repositories, dataset
-providers, aggregators, web archives and other venues. A processing pipeline
-merges catalog data from the primary database and cached values in key-value
-stores and generates the set of about 2.5B references documents, which
-currently serve as an input for the citation graph derivation pipeline.
+been extracted with GROBID\footnote{GROBID
+ \href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the
+TEI-XML results being cached locally in a key-value store accessible with an S3
+API. Archived PDF documents result from dedicated web-scale crawls of scholarly
+domains conducted with
+Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} (and
+other crawl technologies) and a variety of seed lists targeting journal
+homepages, repositories, dataset providers, aggregators, web archives and other
+venues. A processing pipeline merges catalog data from the primary database and
+cached data from the key-value store and generates the set of about 2.5B
+references documents, which currently serve as an input for the citation graph
+derivation pipeline.
\subsection{Methodology}