docs: draft version

author: Martin Czygan <martin.czygan@gmail.com> 2021-09-10 19:30:30 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-09-10 19:30:30 +0200
commit: 5e351469cead9337293b449b946db9d3c2c49925 (patch)
tree: 1c1c08f7df096416c8771f217eb28f381e5dd45a /docs
parent: dd939b5d8ca7e7ad5fdd2022cd63de674043c234 (diff)
download: refcat-5e351469cead9337293b449b946db9d3c2c49925.tar.gz
refcat-5e351469cead9337293b449b946db9d3c2c49925.zip
2 files changed, 33 insertions, 34 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index a827f61..f4273e4 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index dc500dc..b278149 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -42,16 +42,16 @@
 	As part of its scholarly data efforts, the Internet Archive releases a first version of a citation
 	graph dataset, named \emph{refcat}, derived from scholarly publications and
 	additional data sources. It is composed of data gathered by the fatcat
-	cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale
+	cataloging project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related web-scale
 	crawls targeting primary and secondary scholarly outputs, as well as metadata
-	from the Open Library\footnote{\url{https://openlibrary.org}} project and
-	Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
+	from the Open Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}} project and
+	Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}. This first version of the
 	graph consists of over 1.3B citations. We release this dataset under a CC0
 	Public Domain Dedication, accessible through an archive
-	item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}.
+	item\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}.
 	The source code used for the derivation process, including exact and fuzzy
 	citation matching, is released under an MIT
-	license\footnote{\url{https://gitlab.com/internetarchive/refcat}}.
+	license\footnote{\href{https://gitlab.com/internetarchive/refcat}{https://gitlab.com/internetarchive/refcat}}.
 \end{abstract}
 
 \keywords{Citation Graph, Web Archiving}
@@ -79,26 +79,25 @@ were first devised, living on in existing commercial knowledge bases today.
 Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
 - the first version of which contained 6,325,178 individual
 references~\citep{shotton2013publishing}. Other notable projects
-include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
+include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\href{https://citec.repec.org}{https://citec.repec.org}}. The last
 decade has seen the emergence of more openly available, large scale
 citation projects like Microsoft Academic~\citep{sinha2015overview} and the
-Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}.
+Initiative for Open Citations\footnote{\href{https://i4oc.org}{https://i4oc.org}}~\citep{shotton2018funders}.
 In 2021, over one billion citations are publicly available, marking a ``tipping point''
 for this category of data~\citep{hutchins2021tipping}.
 
 While a paper will often cite other papers, more citable entities exist such
 as books or web links and within links a variety of targets, such as web
 pages, reference entries, protocols or datasets. References can be extracted
-manually or through more automated methods, such as metadata access and
-structured data extraction from full text documents; the latter offering the
+manually or through more automated methods, by accessing relevant metadata or
+structured data extraction from full text documents. Automated methods offer the
 benefits of scalability. The completeness of bibliographic metadata ranges from
 documents with one or more persistant identifiers to raw, potentially unclean
 strings partially describing a scholarly artifact.
 
 \section{Related Work}
 
-Two typical problems which arise in the process of compiling a citation graph
-dataset are related to data aquisition and citation matching. Data acquisition
+Two typical problems in citation graph development are related to data aquisition and citation matching. Data acquisition
 itself can take different forms: bibliographic metadata can contain explicit
 reference data as provided by publishers and aggregators; this data can be
 relatively consistent when looked at per source, but may vary in style and
@@ -125,12 +124,12 @@ by~\citep{mathiak2015challenges}.
 Projects centered around citations or containing citation data as a core
 component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI
 citations'', which was first released
-2018-07-29\footnote{\url{https://opencitations.net/download}} and has been
-regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}} and has been
+regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}} project,
 ``a Wikimedia initiative to develop open citations and linked bibliographic
 data to serve free knowledge'' continously adds citations to its
-database\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
-entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
+database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
+entities\footnote{\href{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
 with \emph{PaperReferences} being one relation among many others.
 
 
@@ -170,7 +169,7 @@ or \emph{bref} for short). The dataset includes metadata from fatcat, the Open
 Library project and inbound links from the English Wikipedia.  The dataset is
 integrated into the \href{https://fatcat.wiki}{fatcat.wiki website} and allows users
 to explore inbound and outbound
-references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}.
+references\footnote{\href{https://guide.fatcat.wiki/reference\_graph.html}{https://guide.fatcat.wiki/reference\_graph.html}}.
 
 The format records source and target (fatcat release and work) identifiers, a
 few metadata attributes (such as year or release stage) as well as
@@ -182,7 +181,7 @@ identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI
 for both source and target).  The majority of matches - 1,250,523,321 - is
 established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN).
 72,900,351 citations are established through fuzzy matching techniques.
-Citations from the Open Citations COCI corpus\footnote{Reference dataset COCI
+Citations from the Open Citations' COCI corpus\footnote{Reference dataset COCI
 	v11, released 2021-09-04,
 	\href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}}
 and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:cocicmp}.
@@ -236,7 +235,9 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
 			source-wikipedia    & 1,386,941     \\
 		\end{tabular}
 		\vspace*{2mm}
-		\caption{Output structure, e.g. edges between documents that both have a doi (doi-doi).}
+		\caption{Counts of classic DOI to DOI references as well as outbound
+			references matched against Open Library as well as inbound references
+			from the English Wikipedia.}
 		\label{table:structure}
 	\end{center}
 \end{table}
@@ -244,25 +245,24 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
 We started to include non-traditional citations into the graph, such as links
 to books as recorded by the Open Library project and links from the English
 Wikipedia to scholarly works. For links between Open Library we employ both
-identifier based and fuzzy matching; for Wikipedia references we used an
-existing dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing
+identifier based and fuzzy matching; for Wikipedia references we used a published dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing
 to upstream projects related to wikipedia citation extraction, such as
 \emph{wikiciteparser}\footnote{\href{https://github.com/dissemin/wikiciteparser}{https://github.com/dissemin/wikiciteparser}}
-to generate updates to the dataset. Table~\ref{table:structure} lists the
+to generate updates from recent Wikipedia dumps\footnote{Wikipedia dumps are available on a monthly basis from \href{https://dumps.wikimedia.org/}{https://dumps.wikimedia.org/}.}. Table~\ref{table:structure} lists the
 counts for these links. Additionally, we are examining web links appearing in
 references: after an initial cleaning procedure we currently find 25,405,592
 web links\footnote{The cleaning process is necessary because OCR artifacts and
 	other metadata issues exist in the data. Unfortunately, even after cleaning not
 	all links will be in the form as originally intended by the authors.} in the
-reference corpus, of which 4,827,688 have been preserved with an HTTP 200
+reference corpus, of which 4,827,688 have been preserved as of August 2021 with an HTTP 200
 status code in the Wayback
 Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of
 the Internet Archive. From a sample\footnote{In a sample of 8000 links we find
 	only 6138 responding with a HTTP 200, whereas the rest of the links yields a
 	variety of http status codes, like 404, 403, 500 and others.} we observe, that
-about 23\% of the links in the reference corpus links preserved at the Internet
+about 23\% of the links in the reference corpus preserved at the Internet
 Archive are not accessible on the world wide web currently\footnote{We used the \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command line link checking tool.} - making targeted
-web crawling and preservation of scholarly references an activity for
+web crawling and preservation of scholarly references a key activity for
 maintaining citation integrity.
 
 % unpigz -c fatcat-refs-urllist-2021-06-17_lookup-20210714045637.tsv.gz| LC_ALL=C grep -F ')/' | grep -c -E "\W200\W"
@@ -324,7 +324,7 @@ been extracted with GROBID\footnote{GROBID
 TEI-XML results being cached locally in a key-value store accessible with an S3
 API. Archived PDF documents result from dedicated web-scale crawls of scholarly
 domains conducted with
-Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} (and
+Heritrix\footnote{\href{https://github.com/internetarchive/heritrix3}{https://github.com/internetarchive/heritrix3}} (and
 other crawl technologies) and a variety of seed lists targeting journal
 homepages, repositories, dataset providers, aggregators, web archives and other
 venues. A processing pipeline merges catalog data from the primary database and
@@ -348,7 +348,7 @@ The key derivation can be exact (via an identifier like DOI, PMID, etc) or
 based on a value normalization, like ``slugifying'' a title string. For identifier
 based matches we can generate the target schema directly.  For fuzzy matching
 candidates, we pass possible match pairs through a verification procedure,
-which is implemented for \emph{release entity}\footnote{\url{https://guide.fatcat.wiki/entity_release.html}.} pairs. This procedure is a
+which is implemented for \emph{release entity}\footnote{\href{https://guide.fatcat.wiki/entity\_release.html}{https://guide.fatcat.wiki/entity\_release.html}.} pairs. This procedure is a
 domain dependent rule based verification, able to identify different versions
 of a publication, preprint-published pairs and documents, which are
 are similar by various metrics calculated over title and author fields. The fuzzy matching
@@ -356,10 +356,10 @@ approach is applied on all reference documents without identifier (a title is
 currently required).
 
 We currently implement performance sensitive parts in
-Go\footnote{\url{https://golang.org/}}, with various processing stages (e.g.
+Go\footnote{\href{https://golang.org/}{https://golang.org/}}, with various processing stages (e.g.
 conversion, map, reduce, ...) represented by separate command line tools. A
 thin task orchestration layer using the luigi
-framework\footnote{\url{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani},
+framework\footnote{\href{https://github.com/spotify/luigi}{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani},
 	which has been used in various scientific pipeline
 	application, like~\citep{schulz2016use},~\citep{erdmann2017design},~\citep{lampa2019scipipe},~\citep{czygan2014design}
 	and others.} allows for experimentation in the pipeline and for single command
@@ -374,7 +374,7 @@ candidate generation phase in order to improve recall, but we are strict during
 verification, in order to control precision. Quality assurance for verification is
 implemented through a growing list of test cases of real examples from the catalog and
 their expected or desired match status\footnote{The list can be found under:
-	\url{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}.
+	\href{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}.
 	It is helpful to keep this test suite independent of any specific programming language.}.
 
 
@@ -386,8 +386,8 @@ As other dataset in this field we expect this dataset to be iterated upon.
 \begin{itemize}
 	\item The fatcat catalog updates its metadata
 	      continously\footnote{A changelog can currenly be followed here:
-		      \url{https://fatcat.wiki/changelog}.} and web crawls are conducted
-	      regularly.  Current processing pipelines cover raw reference snapshot
+		      \href{https://fatcat.wiki/changelog}{https://fatcat.wiki/changelog}.} and web crawls are conducted
+	      regularly. Current processing pipelines cover raw reference snapshot
 	      creation and derivation of the graph structure, which allows to rerun
 	      processing based on updated data as it becomes available.
 
@@ -399,11 +399,10 @@ As other dataset in this field we expect this dataset to be iterated upon.
 
 	\item As of this version, a number of raw reference
 	      docs remain unmatched, which means that neither exact nor fuzzy matching
-	      has detected a link to a known entity. On the one
-	      hand, this can hint at missing metadata. However, parts of the data
+	      has detected a link to a known entity. Metadata might be missing. However, parts of the data
 	      will contain a reference to a catalogued entity, but in a specific,
 	      dense and harder to recover form.
-	      This also include improvements to the fuzzy matching approach.
+
 	\item The reference dataset contains millions of URLs and their integration
 	      into the graph has been implemented as a prototype. A full implementation
 	      requires a few data cleanup and normalization steps.
author	Martin Czygan <martin.czygan@gmail.com>	2021-09-10 19:30:30 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-09-10 19:30:30 +0200
commit	5e351469cead9337293b449b946db9d3c2c49925 (patch)
tree	1c1c08f7df096416c8771f217eb28f381e5dd45a /docs
parent	dd939b5d8ca7e7ad5fdd2022cd63de674043c234 (diff)
download	refcat-5e351469cead9337293b449b946db9d3c2c49925.tar.gz refcat-5e351469cead9337293b449b946db9d3c2c49925.zip