docs: cleanup and naming

author: Martin Czygan <martin.czygan@gmail.com> 2021-08-08 15:18:29 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-08-08 15:18:29 +0200
commit: bd66b58cded2c2c7e7b7e5d374434d6531dd70de (patch)
tree: 00417812b9787ab4492e2c590fcf1bf6f4b576e7 /docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
parent: bb64b3aa62267676302e75f0ca44157b514beec4 (diff)
download: refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.tar.gz
refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.zip
1 files changed, 362 insertions, 0 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
new file mode 100644
index 0000000..e4febd9
--- /dev/null
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -0,0 +1,362 @@
+\documentclass[hidelinks,10pt,twocolumn]{article}
+\usepackage{simpleConference}
+\usepackage[utf8]{inputenc}
+\usepackage{times}
+\usepackage{graphicx}
+\usepackage{natbib}
+\usepackage{doi}
+\usepackage{amssymb}
+\usepackage{url,hyperref}
+\usepackage{booktabs}       % professional-quality tables
+\usepackage{amsfonts}       % blackboard math symbols
+\usepackage{nicefrac}       % compact symbols for 1/2, etc.
+\usepackage{caption}
+
+\usepackage{datetime}
+\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
+\setlength{\parindent}{0pt}
+
+\begin{document}
+
+\title{Fatcat Reference Dataset}
+
+\author{Martin Czygan \\
+	\\
+	Internet Archive \\
+	San Francisco, California, USA \\
+	martin@archive.org  \\
+	\and
+	Bryan Newbold \\
+	\\
+	Internet Archive \\
+	San Francisco, California, USA \\
+	bnewbold@archive.org  \\
+	\\
+}
+
+
+\maketitle
+\thispagestyle{empty}
+
+
+\begin{abstract}
+	As part of its scholarly data efforts, the Internet Archive releases a first version of a citation
+	graph dataset, named \emph{refcat}, derived from scholarly publications and
+	additional data sources. It is composed of data gathered by the fatcat
+	cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale
+	crawls targeting primary and secondary scholarly outputs, as well as metadata
+	from the Open Library\footnote{\url{https://openlibrary.org}} project and
+	Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
+	graph consists of 1,323,423,672 citations. We release this dataset under a CC0
+	Public Domain Dedication, accessible through an archive
+	item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}. All
+	code used in the derivation process is released under an MIT
+	license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
+\end{abstract}
+
+\keywords{Citation Graph, Web Archiving}
+
+\section{Introduction}
+
+
+The Internet Archive releases a first version of a citation graph dataset
+derived from a raw corpus of about 2.5B references gathered from metadata and
+data obtained by PDF extraction tools such as
+GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
+metadata from Open Library and Wikipedia.
+The goal of this report is to describe briefly the current contents and the
+derivation of the dataset. We expect
+this dataset to be iterated upon, with changes both in content and processing.
+
+Modern citation indexes can be traced back to the early computing age, when
+projects like the Science Citation Index (1955)\citep{garfield2007evolution}
+were first devised, living on in existing commercial knowledge bases today.
+Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
+- the first version of which contained 6,325,178 individual
+references\citep{shotton2013publishing}. Other notable early projects
+include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
+decade has seen the emergence of more openly available, large scale
+citation projects like Microsoft Academic\citep{sinha2015overview} or the
+Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
+according to \citep{hutchins2021tipping} over 1B citations are publicly
+available, marking a tipping point for this category of data.
+
+\section{Related Work}
+
+There are a few large scale citation dataset available today. COCI, the
+``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
+released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
+2021-07-29, it contains
+1,094,394,688 citations across 65,835,422 bibliographic
+resources\citep{peroni2020opencitations}.
+
+The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+``a Wikimedia initiative to develop open citations and linked bibliographic
+data to serve free knowledge'' continously adds citations to its database and
+as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
+publications\footnote{\url{http://wikicite.org/statistics.html}}.
+
+Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of
+entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
+with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
+	\url{https://archive.org/details/mag-2021-06-07}}  the
+\emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466
+bibliographic entities.
+
+Numerous other projects have been or are concerned with various aspects of
+citation discovery and curation as part their feature set, among them Semantic
+Scholar\citep{fricke2018semantic}, CiteSeerX\citep{li2006citeseerx} or Aminer\citep{tang2016aminer}.
+
+As mentioned in \citep{hutchins2021tipping}, the number of openly available
+citations is not expected to shrink in the future.
+
+
+\section{Dataset}
+
+We release the first version of the \emph{refcat} dataset in an format used
+internally for storage and to serve queries (and which we call \emph{biblioref}
+or \emph{bref} for short). The dataset includes metadata from fatcat, the
+Open Library Project and inbound links from the English Wikipedia. The fatcat
+project itself aggregates data from variety of open data sources, such as
+Crossref\citep{crossref}, PubMed\citep{canese2013pubmed},
+DataCite\citep{brase2009datacite}, DOAJ\citep{doaj}, dblp\citep{ley2002dblp} and others,
+as well as metadata generated from analysis of data preserved at the Internet
+Archive and active crawls of publication sites on the web.
+
+The dataset is
+integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users
+to explore inbound and outbound references\cite{fatcatguidereferencegraph}.
+
+The format records source and target (fatcat release and work) identifiers, a
+few attributes from the metadata (such as year or release stage) as well as
+information about the match status and provanance.
+
+The dataset currently contains 1,323,423,672 citations across 76,327,662
+entities (55,123,635 unique source and 60,244,206 unique target work
+identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI
+for both source and target).
+The majority of matches - 1,250,523,321 - are established through identifier
+based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are
+established through fuzzy matching techniques.
+
+The majority of citations between \emph{refcat} and COCI overlap, as can be
+seen in~Table~\ref{table:cocicmp}.
+
+\begin{table}[]
+	\begin{center}
+		\begin{tabular}{ll}
+			\toprule
+			\bf{Set}              & \bf{Count}    \\
+
+			\midrule
+			COCI (C)              & 1,094,394,688 \\
+			\emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
+			C $\cap$ R            & 1,007,539,966 \\
+			C $\setminus$ R       & 86,854,309    \\
+			R $\setminus$ C       & 295,884,246
+		\end{tabular}
+		\vspace*{2mm}
+		\caption{Comparison between COCI and \emph{refcat-doi}, a subset of
+			\emph{refcat} where entities have a known DOI. At least 50\% of the
+			295,884,246 references only in \emph{refcat-doi} come from links
+			recorded within a specific dataset provider (GBIF, DOI prefix:
+			10.15468).}
+		\label{table:cocicmp}
+	\end{center}
+\end{table}
+
+% zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst
+% zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst
+% find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst
+
+
+\section{System Design}
+
+The constraints for the systems design are informed by the volume and the
+variety of the data. The capability to run the whole graph derivation on a
+single machine was a minor goal as well. In total, the raw inputs amount to a
+few terabytes of textual content, mostly newline delimited JSON. More
+importantly, while the number of data fields is low, certain schemas are very
+partial with hundreds of different combinations of available field values found
+in the raw reference data. This is most likely caused by aggregators passing on
+reference data coming from hundreds of sources, each of which not necessarily
+agreeing on a common granularity for citation data and from artifacts of
+machine learning based structured data extraction tools.
+
+Each combination of fields may require a slightly different processing path.
+For example, references with an Arxiv identifier can be processed differently
+from references with only a title. Over 50\% of the raw reference data comes
+from a set of eight field set manifestations, as listed in
+Table~\ref{table:fields}.
+
+\begin{table}[]
+	\begin{center}
+		\begin{tabular}{ll}
+			\toprule
+			\bf{Fields}                                                                                     & \bf{Percentage} \\
+			\midrule
+			\multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$  U $\cdot$  V $\cdot$ Y}           & 14\%            \\
+			\multicolumn{1}{l}{\textbf{DOI}}                                                                & 14\%            \\
+			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\%             \\
+			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ U $\cdot$ V $\cdot$ Y}           & 4\%             \\
+			\multicolumn{1}{l}{\textbf{PMID} $\cdot$ U}                                                     & 4\%             \\
+			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ T $\cdot$ V $\cdot$ Y}           & 4\%             \\
+			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y}                                                    & 4\%             \\
+			\multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ V $\cdot$ Y}                     & 4\%             \\
+		\end{tabular}
+		\vspace*{2mm}
+		\caption{Top 8 combinations of available fields in raw reference data
+			accounting for about 53\% of the total data (CN = container name, CRN =
+			contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS =
+			issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value. Identifiers emphasized.}
+		\label{table:fields}
+	\end{center}
+\end{table}
+
+Overall, a map-reduce style\citep{dean2010mapreduce} approach is
+followed\footnote{While the operations are similar, the processing is not
+	distributed but runs on a single machine. For space efficiency, zstd\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
+for some
+uniformity in the overall processing. We extract (key, document) tuples (as
+TSV) from the raw JSON data and sort by key. We then group documents with the
+same key and apply a function on each group in order to generate
+our target schema or perform
+additional operations such as deduplication or fusion of matched and unmatched references.
+
+The key derivation can be exact (via an identifier like DOI, PMID, etc) or
+based on a value normalization, like slugifying a title string. For identifier
+based matches we can generate the target schema directly.  For fuzzy matching
+candidates, we pass possible match pairs through a verification procedure,
+which is implemented for \emph{release entity}\footnote{\url{https://guide.fatcat.wiki/entity_release.html}.} pairs. This procedure is a
+domain dependent rule based verification, able to identify different versions
+of a publication, preprint-published pairs and documents, which are
+are similar by various metrics calculated over title and author fields. The fuzzy matching
+approach is applied on all reference documents without identifier (a title is
+currently required).
+
+With a few schema conversions, fuzzy matching can be applied to Wikipedia
+articles and Open Library (edition) records as well. The aspect of precision
+and recall are represented by the two stages: we are generous in the match
+candidate generation phase in order to improve recall, but we are strict during
+verification, in order to control precision. Quality assurance for verification is
+implemented through a growing list of test cases of real examples from the catalog and
+their expected or desired match status\footnote{The list can be found under:
+	\url{https://gitlab.com/internetarchive/cgraph/-/blob/master/skate/testdata/verify.csv}.
+	It is helpful to keep this test suite independent of any specific programming language.}.
+
+
+\section{Limitations and Future Work}
+
+As other dataset in this field we expect this dataset to be iterated upon.
+
+\begin{itemize}
+	\item The fatcat catalog updates its metadata
+	      continously\footnote{A changelog can currenly be followed here:
+		      \url{https://fatcat.wiki/changelog}} and web crawls are conducted
+	      regularly.  Current processing pipelines cover raw reference snapshot
+	      creation and derivation of the graph structure, which allows to rerun
+	      processing based on updated data as it becomes available.
+
+	\item Metadata extraction from PDFs depends on supervised machine learning
+	      models, which in turn depend on available training datasets. With additional crawls and
+	      metadata available we hope to improve models used for metadata
+	      extraction, improving yield and reducing data extraction artifacts in
+	      the process.
+
+	\item As of this version, a number of raw reference
+	      docs remain unmatched, which means that neither exact nor fuzzy matching
+	      has detected a link to a known entity. On the one
+	      hand, this can hint at missing metadata. However, parts of the data
+	      will contain a reference to a catalogued entity, but in a specific,
+	      dense and harder to recover form.
+	      This also include improvements to the fuzzy matching approach.
+	\item The reference dataset contains millions of URLs and their integration
+	      into the graph has been implemented as prototype. A full implementation
+	      requires a few data cleanup and normalization steps.
+\end{itemize}
+
+\section{Acknowledgements}
+
+This work is partially supported by a grant from the \emph{Andrew W. Mellon
+	Foundation}.
+
+
+\section{Appendix A}
+
+
+A note on data quality: While we implement various data quality measures,
+real-world data, especially coming from many different sources will contain
+issues. Among other measures, we keep track of match reasons,
+especially for fuzzy matching to be able to zoom in on systematic errors
+more easily (see~Table~\ref{table:matches}).
+
+\begin{table}[]
+	\footnotesize
+	\captionsetup{font=normalsize}
+	\begin{center}
+		\begin{tabular}{@{}rlll@{}}
+			\toprule
+			\textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason}      \\ \midrule
+			934932865      & crossref            & exact           & doi                  \\
+			151366108      & fatcat-datacite     & exact           & doi                  \\
+			65345275       & fatcat-pubmed       & exact           & pmid                 \\
+			48778607       & fuzzy               & strong          & jaccardauthors       \\
+			42465250       & grobid              & exact           & doi                  \\
+			29197902       & fatcat-pubmed       & exact           & doi                  \\
+			19996327       & fatcat-crossref     & exact           & doi                  \\
+			11996694       & fuzzy               & strong          & slugtitleauthormatch \\
+			9157498        & fuzzy               & strong          & tokenizedauthors     \\
+			3547594        & grobid              & exact           & arxiv                \\
+			2310025        & fuzzy               & exact           & titleauthormatch     \\
+			1496515        & grobid              & exact           & pmid                 \\
+			680722         & crossref            & strong          & jaccardauthors       \\
+			476331         & fuzzy               & strong          & versioneddoi         \\
+			449271         & grobid              & exact           & isbn                 \\
+			230645         & fatcat-crossref     & strong          & jaccardauthors       \\
+			190578         & grobid              & strong          & jaccardauthors       \\
+			156657         & crossref            & exact           & isbn                 \\
+			123681         & fatcat-pubmed       & strong          & jaccardauthors       \\
+			79328          & crossref            & exact           & arxiv                \\
+			57414          & crossref            & strong          & tokenizedauthors     \\
+			53480          & fuzzy               & strong          & pmiddoipair          \\
+			52453          & fuzzy               & strong          & dataciterelatedid    \\
+			47119          & grobid              & strong          & slugtitleauthormatch \\
+			36774          & fuzzy               & strong          & arxivversion         \\
+			% 35311          & fuzzy               & strong          & customieeearxiv      \\
+			% 33863          & grobid              & exact           & pmcid                \\
+			% 23504          & crossref            & strong          & slugtitleauthormatch \\
+			% 22753          & fatcat-crossref     & strong          & tokenizedauthors     \\
+			% 17720          & grobid              & exact           & titleauthormatch     \\
+			% 14656          & crossref            & exact           & titleauthormatch     \\
+			% 14438          & grobid              & strong          & tokenizedauthors     \\
+			% 7682           & fatcat-crossref     & exact           & arxiv                \\
+			% 5972           & fatcat-crossref     & exact           & isbn                 \\
+			% 5525           & fatcat-pubmed       & exact           & arxiv                \\
+			% 4290           & fatcat-pubmed       & strong          & tokenizedauthors     \\
+			% 2745           & fatcat-pubmed       & exact           & isbn                 \\
+			% 2342           & fatcat-pubmed       & strong          & slugtitleauthormatch \\
+			% 2273           & fatcat-crossref     & strong          & slugtitleauthormatch \\
+			% 1960           & fuzzy               & exact           & workid               \\
+			% 1150           & fatcat-crossref     & exact           & titleauthormatch     \\
+			% 1041           & fatcat-pubmed       & exact           & titleauthormatch     \\
+			% 895            & fuzzy               & strong          & figshareversion      \\
+			% 317            & fuzzy               & strong          & titleartifact        \\
+			% 82             & grobid              & strong          & titleartifact        \\
+			% 33             & crossref            & strong          & titleartifact        \\
+			% 5              & fuzzy               & strong          & custombsiundated     \\
+			% 1              & fuzzy               & strong          & custombsisubdoc      \\
+			% 1              & fatcat              & exact           & doi                  \\ \bottomrule
+		\end{tabular}
+		\vspace*{2mm}
+		\caption{Table of match counts (top 25), reference provenance, match status and
+			match reason. The match reason identifier encode a specific rule in the domain
+			dependent verification process and are included for completeness - we do not
+			include the details of each rule in this report.}
+		\label{table:matches}
+	\end{center}
+\end{table}
+
+\bibliographystyle{abbrv}
+% \bibliographystyle{plainnat}
+\bibliography{refs}
+\end{document}
author	Martin Czygan <martin.czygan@gmail.com>	2021-08-08 15:18:29 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-08-08 15:18:29 +0200
commit	bd66b58cded2c2c7e7b7e5d374434d6531dd70de (patch)
tree	00417812b9787ab4492e2c590fcf1bf6f4b576e7 /docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
parent	bb64b3aa62267676302e75f0ca44157b514beec4 (diff)
download	refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.tar.gz refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.zip