aboutsummaryrefslogtreecommitdiffstats
path: root/docs/TR-20210808100000-IA-WDS-REFCAT
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-09-10 19:30:30 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-09-10 19:30:30 +0200
commit5e351469cead9337293b449b946db9d3c2c49925 (patch)
tree1c1c08f7df096416c8771f217eb28f381e5dd45a /docs/TR-20210808100000-IA-WDS-REFCAT
parentdd939b5d8ca7e7ad5fdd2022cd63de674043c234 (diff)
downloadrefcat-5e351469cead9337293b449b946db9d3c2c49925.tar.gz
refcat-5e351469cead9337293b449b946db9d3c2c49925.zip
docs: draft version
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.pdfbin105989 -> 93825 bytes
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex67
2 files changed, 33 insertions, 34 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index a827f61..f4273e4 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
Binary files differ
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index dc500dc..b278149 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -42,16 +42,16 @@
As part of its scholarly data efforts, the Internet Archive releases a first version of a citation
graph dataset, named \emph{refcat}, derived from scholarly publications and
additional data sources. It is composed of data gathered by the fatcat
- cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale
+ cataloging project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related web-scale
crawls targeting primary and secondary scholarly outputs, as well as metadata
- from the Open Library\footnote{\url{https://openlibrary.org}} project and
- Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
+ from the Open Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}} project and
+ Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}. This first version of the
graph consists of over 1.3B citations. We release this dataset under a CC0
Public Domain Dedication, accessible through an archive
- item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}.
+ item\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}.
The source code used for the derivation process, including exact and fuzzy
citation matching, is released under an MIT
- license\footnote{\url{https://gitlab.com/internetarchive/refcat}}.
+ license\footnote{\href{https://gitlab.com/internetarchive/refcat}{https://gitlab.com/internetarchive/refcat}}.
\end{abstract}
\keywords{Citation Graph, Web Archiving}
@@ -79,26 +79,25 @@ were first devised, living on in existing commercial knowledge bases today.
Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
references~\citep{shotton2013publishing}. Other notable projects
-include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
+include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\href{https://citec.repec.org}{https://citec.repec.org}}. The last
decade has seen the emergence of more openly available, large scale
citation projects like Microsoft Academic~\citep{sinha2015overview} and the
-Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}.
+Initiative for Open Citations\footnote{\href{https://i4oc.org}{https://i4oc.org}}~\citep{shotton2018funders}.
In 2021, over one billion citations are publicly available, marking a ``tipping point''
for this category of data~\citep{hutchins2021tipping}.
While a paper will often cite other papers, more citable entities exist such
as books or web links and within links a variety of targets, such as web
pages, reference entries, protocols or datasets. References can be extracted
-manually or through more automated methods, such as metadata access and
-structured data extraction from full text documents; the latter offering the
+manually or through more automated methods, by accessing relevant metadata or
+structured data extraction from full text documents. Automated methods offer the
benefits of scalability. The completeness of bibliographic metadata ranges from
documents with one or more persistant identifiers to raw, potentially unclean
strings partially describing a scholarly artifact.
\section{Related Work}
-Two typical problems which arise in the process of compiling a citation graph
-dataset are related to data aquisition and citation matching. Data acquisition
+Two typical problems in citation graph development are related to data aquisition and citation matching. Data acquisition
itself can take different forms: bibliographic metadata can contain explicit
reference data as provided by publishers and aggregators; this data can be
relatively consistent when looked at per source, but may vary in style and
@@ -125,12 +124,12 @@ by~\citep{mathiak2015challenges}.
Projects centered around citations or containing citation data as a core
component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI
citations'', which was first released
-2018-07-29\footnote{\url{https://opencitations.net/download}} and has been
-regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}} and has been
+regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}} project,
``a Wikimedia initiative to develop open citations and linked bibliographic
data to serve free knowledge'' continously adds citations to its
-database\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
-entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
+database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
+entities\footnote{\href{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
with \emph{PaperReferences} being one relation among many others.
@@ -170,7 +169,7 @@ or \emph{bref} for short). The dataset includes metadata from fatcat, the Open
Library project and inbound links from the English Wikipedia. The dataset is
integrated into the \href{https://fatcat.wiki}{fatcat.wiki website} and allows users
to explore inbound and outbound
-references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}.
+references\footnote{\href{https://guide.fatcat.wiki/reference\_graph.html}{https://guide.fatcat.wiki/reference\_graph.html}}.
The format records source and target (fatcat release and work) identifiers, a
few metadata attributes (such as year or release stage) as well as
@@ -182,7 +181,7 @@ identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI
for both source and target). The majority of matches - 1,250,523,321 - is
established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN).
72,900,351 citations are established through fuzzy matching techniques.
-Citations from the Open Citations COCI corpus\footnote{Reference dataset COCI
+Citations from the Open Citations' COCI corpus\footnote{Reference dataset COCI
v11, released 2021-09-04,
\href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}}
and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:cocicmp}.
@@ -236,7 +235,9 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
source-wikipedia & 1,386,941 \\
\end{tabular}
\vspace*{2mm}
- \caption{Output structure, e.g. edges between documents that both have a doi (doi-doi).}
+ \caption{Counts of classic DOI to DOI references as well as outbound
+ references matched against Open Library as well as inbound references
+ from the English Wikipedia.}
\label{table:structure}
\end{center}
\end{table}
@@ -244,25 +245,24 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
We started to include non-traditional citations into the graph, such as links
to books as recorded by the Open Library project and links from the English
Wikipedia to scholarly works. For links between Open Library we employ both
-identifier based and fuzzy matching; for Wikipedia references we used an
-existing dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing
+identifier based and fuzzy matching; for Wikipedia references we used a published dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing
to upstream projects related to wikipedia citation extraction, such as
\emph{wikiciteparser}\footnote{\href{https://github.com/dissemin/wikiciteparser}{https://github.com/dissemin/wikiciteparser}}
-to generate updates to the dataset. Table~\ref{table:structure} lists the
+to generate updates from recent Wikipedia dumps\footnote{Wikipedia dumps are available on a monthly basis from \href{https://dumps.wikimedia.org/}{https://dumps.wikimedia.org/}.}. Table~\ref{table:structure} lists the
counts for these links. Additionally, we are examining web links appearing in
references: after an initial cleaning procedure we currently find 25,405,592
web links\footnote{The cleaning process is necessary because OCR artifacts and
other metadata issues exist in the data. Unfortunately, even after cleaning not
all links will be in the form as originally intended by the authors.} in the
-reference corpus, of which 4,827,688 have been preserved with an HTTP 200
+reference corpus, of which 4,827,688 have been preserved as of August 2021 with an HTTP 200
status code in the Wayback
Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of
the Internet Archive. From a sample\footnote{In a sample of 8000 links we find
only 6138 responding with a HTTP 200, whereas the rest of the links yields a
variety of http status codes, like 404, 403, 500 and others.} we observe, that
-about 23\% of the links in the reference corpus links preserved at the Internet
+about 23\% of the links in the reference corpus preserved at the Internet
Archive are not accessible on the world wide web currently\footnote{We used the \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command line link checking tool.} - making targeted
-web crawling and preservation of scholarly references an activity for
+web crawling and preservation of scholarly references a key activity for
maintaining citation integrity.
% unpigz -c fatcat-refs-urllist-2021-06-17_lookup-20210714045637.tsv.gz| LC_ALL=C grep -F ')/' | grep -c -E "\W200\W"
@@ -324,7 +324,7 @@ been extracted with GROBID\footnote{GROBID
TEI-XML results being cached locally in a key-value store accessible with an S3
API. Archived PDF documents result from dedicated web-scale crawls of scholarly
domains conducted with
-Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} (and
+Heritrix\footnote{\href{https://github.com/internetarchive/heritrix3}{https://github.com/internetarchive/heritrix3}} (and
other crawl technologies) and a variety of seed lists targeting journal
homepages, repositories, dataset providers, aggregators, web archives and other
venues. A processing pipeline merges catalog data from the primary database and
@@ -348,7 +348,7 @@ The key derivation can be exact (via an identifier like DOI, PMID, etc) or
based on a value normalization, like ``slugifying'' a title string. For identifier
based matches we can generate the target schema directly. For fuzzy matching
candidates, we pass possible match pairs through a verification procedure,
-which is implemented for \emph{release entity}\footnote{\url{https://guide.fatcat.wiki/entity_release.html}.} pairs. This procedure is a
+which is implemented for \emph{release entity}\footnote{\href{https://guide.fatcat.wiki/entity\_release.html}{https://guide.fatcat.wiki/entity\_release.html}.} pairs. This procedure is a
domain dependent rule based verification, able to identify different versions
of a publication, preprint-published pairs and documents, which are
are similar by various metrics calculated over title and author fields. The fuzzy matching
@@ -356,10 +356,10 @@ approach is applied on all reference documents without identifier (a title is
currently required).
We currently implement performance sensitive parts in
-Go\footnote{\url{https://golang.org/}}, with various processing stages (e.g.
+Go\footnote{\href{https://golang.org/}{https://golang.org/}}, with various processing stages (e.g.
conversion, map, reduce, ...) represented by separate command line tools. A
thin task orchestration layer using the luigi
-framework\footnote{\url{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani},
+framework\footnote{\href{https://github.com/spotify/luigi}{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani},
which has been used in various scientific pipeline
application, like~\citep{schulz2016use},~\citep{erdmann2017design},~\citep{lampa2019scipipe},~\citep{czygan2014design}
and others.} allows for experimentation in the pipeline and for single command
@@ -374,7 +374,7 @@ candidate generation phase in order to improve recall, but we are strict during
verification, in order to control precision. Quality assurance for verification is
implemented through a growing list of test cases of real examples from the catalog and
their expected or desired match status\footnote{The list can be found under:
- \url{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}.
+ \href{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}.
It is helpful to keep this test suite independent of any specific programming language.}.
@@ -386,8 +386,8 @@ As other dataset in this field we expect this dataset to be iterated upon.
\begin{itemize}
\item The fatcat catalog updates its metadata
continously\footnote{A changelog can currenly be followed here:
- \url{https://fatcat.wiki/changelog}.} and web crawls are conducted
- regularly. Current processing pipelines cover raw reference snapshot
+ \href{https://fatcat.wiki/changelog}{https://fatcat.wiki/changelog}.} and web crawls are conducted
+ regularly. Current processing pipelines cover raw reference snapshot
creation and derivation of the graph structure, which allows to rerun
processing based on updated data as it becomes available.
@@ -399,11 +399,10 @@ As other dataset in this field we expect this dataset to be iterated upon.
\item As of this version, a number of raw reference
docs remain unmatched, which means that neither exact nor fuzzy matching
- has detected a link to a known entity. On the one
- hand, this can hint at missing metadata. However, parts of the data
+ has detected a link to a known entity. Metadata might be missing. However, parts of the data
will contain a reference to a catalogued entity, but in a specific,
dense and harder to recover form.
- This also include improvements to the fuzzy matching approach.
+
\item The reference dataset contains millions of URLs and their integration
into the graph has been implemented as a prototype. A full implementation
requires a few data cleanup and normalization steps.