\documentclass[hidelinks,10pt,twocolumn]{article} \usepackage{simpleConference} \usepackage[utf8]{inputenc} \usepackage{times} \usepackage{graphicx} \usepackage{natbib} \usepackage{doi} \usepackage{amssymb} \usepackage{url,hyperref} \usepackage{booktabs} % professional-quality tables \usepackage{amsfonts} % blackboard math symbols \usepackage{nicefrac} % compact symbols for 1/2, etc. \usepackage{caption} \usepackage{datetime} \providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1} \setlength{\parindent}{0pt} \begin{document} \title{Fatcat Reference Dataset} \author{Martin Czygan \\ \\ Internet Archive \\ San Francisco, California, USA \\ martin@archive.org \\ \and Bryan Newbold \\ \\ Internet Archive \\ San Francisco, California, USA \\ bnewbold@archive.org \\ \\ } \maketitle \thispagestyle{empty} \begin{abstract} As part of its scholarly data efforts, the Internet Archive releases a first version of a citation graph dataset, named \emph{refcat}, derived from scholarly publications and additional data sources. It is composed of data gathered by the fatcat cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale crawls targeting primary and secondary scholarly outputs, as well as metadata from the Open Library\footnote{\url{https://openlibrary.org}} project and Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the graph consists of 1,323,423,672 citations. We release this dataset under a CC0 Public Domain Dedication, accessible through an archive item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}. All code used in the derivation process is released under an MIT license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}. \end{abstract} \keywords{Citation Graph, Web Archiving} \section{Introduction} The Internet Archive releases a first version of a citation graph dataset derived from a raw corpus of about 2.5B references gathered from metadata and data obtained by PDF extraction tools such as GROBID\cite{lopez2009grobid}. Additionally, we consider integration with metadata from Open Library and Wikipedia. The goal of this report is to describe briefly the current contents and the derivation of the dataset. We expect this dataset to be iterated upon, with changes both in content and processing. Modern citation indexes can be traced back to the early computing age, when projects like the Science Citation Index (1955)\citep{garfield2007evolution} were first devised, living on in existing commercial knowledge bases today. Open alternatives were started such as the Open Citations Corpus (OCC) in 2010 - the first version of which contained 6,325,178 individual references\citep{shotton2013publishing}. Other notable early projects include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last decade has seen the emergence of more openly available, large scale citation projects like Microsoft Academic\citep{sinha2015overview} or the Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021, according to \citep{hutchins2021tipping} over 1B citations are publicly available, marking a tipping point for this category of data. \section{Related Work} There are a few large scale citation dataset available today. COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on 2021-07-29, it contains 1,094,394,688 citations across 65,835,422 bibliographic resources\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, ``a Wikimedia initiative to develop open citations and linked bibliographic data to serve free knowledge'' continously adds citations to its database and as of 2021-06-28 tracks 253,719,394 citations across 39,994,937 publications\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at \url{https://archive.org/details/mag-2021-06-07}} the \emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466 bibliographic entities. Numerous other projects have been or are concerned with various aspects of citation discovery and curation as part their feature set, among them Semantic Scholar\citep{fricke2018semantic}, CiteSeerX\citep{li2006citeseerx} or Aminer\citep{tang2016aminer}. As mentioned in \citep{hutchins2021tipping}, the number of openly available citations is not expected to shrink in the future. \section{Dataset} We release the first version of the \emph{refcat} dataset in an format used internally for storage and to serve queries (and which we call \emph{biblioref} or \emph{bref} for short). The dataset includes metadata from fatcat, the Open Library Project and inbound links from the English Wikipedia. The fatcat project itself aggregates data from variety of open data sources, such as Crossref\citep{crossref}, PubMed\citep{canese2013pubmed}, DataCite\citep{brase2009datacite}, DOAJ\citep{doaj}, dblp\citep{ley2002dblp} and others, as well as metadata generated from analysis of data preserved at the Internet Archive and active crawls of publication sites on the web. The dataset is integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users to explore inbound and outbound references\cite{fatcatguidereferencegraph}. The format records source and target (fatcat release and work) identifiers, a few attributes from the metadata (such as year or release stage) as well as information about the match status and provanance. The dataset currently contains 1,323,423,672 citations across 76,327,662 entities (55,123,635 unique source and 60,244,206 unique target work identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI for both source and target). The majority of matches - 1,250,523,321 - are established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are established through fuzzy matching techniques. The majority of citations between \emph{refcat} and COCI overlap, as can be seen in~Table~\ref{table:cocicmp}. \begin{table}[] \begin{center} \begin{tabular}{ll} \toprule \bf{Set} & \bf{Count} \\ \midrule COCI (C) & 1,094,394,688 \\ \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst C $\cap$ R & 1,007,539,966 \\ C $\setminus$ R & 86,854,309 \\ R $\setminus$ C & 295,884,246 \end{tabular} \vspace*{2mm} \caption{Comparison between COCI and \emph{refcat-doi}, a subset of \emph{refcat} where entities have a known DOI. At least 50\% of the 295,884,246 references only in \emph{refcat-doi} come from links recorded within a specific dataset provider (GBIF, DOI prefix: 10.15468).} \label{table:cocicmp} \end{center} \end{table} % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst % zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst % find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst \section{System Design} The constraints for the systems design are informed by the volume and the variety of the data. The capability to run the whole graph derivation on a single machine was a minor goal as well. In total, the raw inputs amount to a few terabytes of textual content, mostly newline delimited JSON. More importantly, while the number of data fields is low, certain schemas are very partial with hundreds of different combinations of available field values found in the raw reference data. This is most likely caused by aggregators passing on reference data coming from hundreds of sources, each of which not necessarily agreeing on a common granularity for citation data and from artifacts of machine learning based structured data extraction tools. Each combination of fields may require a slightly different processing path. For example, references with an Arxiv identifier can be processed differently from references with only a title. Over 50\% of the raw reference data comes from a set of eight field set manifestations, as listed in Table~\ref{table:fields}. \begin{table}[] \begin{center} \begin{tabular}{ll} \toprule \bf{Fields} & \bf{Percentage} \\ \midrule \multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 14\% \\ \multicolumn{1}{l}{\textbf{DOI}} & 14\% \\ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\% \\ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ U $\cdot$ V $\cdot$ Y} & 4\% \\ \multicolumn{1}{l}{\textbf{PMID} $\cdot$ U} & 4\% \\ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ T $\cdot$ V $\cdot$ Y} & 4\% \\ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y} & 4\% \\ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ V $\cdot$ Y} & 4\% \\ \end{tabular} \vspace*{2mm} \caption{Top 8 combinations of available fields in raw reference data accounting for about 53\% of the total data (CN = container name, CRN = contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS = issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value. Identifiers emphasized.} \label{table:fields} \end{center} \end{table} Overall, a map-reduce style\citep{dean2010mapreduce} approach is followed\footnote{While the operations are similar, the processing is not distributed but runs on a single machine. For space efficiency, zstd\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows for some uniformity in the overall processing. We extract (key, document) tuples (as TSV) from the raw JSON data and sort by key. We then group documents with the same key and apply a function on each group in order to generate our target schema or perform additional operations such as deduplication or fusion of matched and unmatched references. The key derivation can be exact (via an identifier like DOI, PMID, etc) or based on a value normalization, like slugifying a title string. For identifier based matches we can generate the target schema directly. For fuzzy matching candidates, we pass possible match pairs through a verification procedure, which is implemented for \emph{release entity}\footnote{\url{https://guide.fatcat.wiki/entity_release.html}.} pairs. This procedure is a domain dependent rule based verification, able to identify different versions of a publication, preprint-published pairs and documents, which are are similar by various metrics calculated over title and author fields. The fuzzy matching approach is applied on all reference documents without identifier (a title is currently required). With a few schema conversions, fuzzy matching can be applied to Wikipedia articles and Open Library (edition) records as well. The aspect of precision and recall are represented by the two stages: we are generous in the match candidate generation phase in order to improve recall, but we are strict during verification, in order to control precision. Quality assurance for verification is implemented through a growing list of test cases of real examples from the catalog and their expected or desired match status\footnote{The list can be found under: \url{https://gitlab.com/internetarchive/cgraph/-/blob/master/skate/testdata/verify.csv}. It is helpful to keep this test suite independent of any specific programming language.}. \section{Limitations and Future Work} As other dataset in this field we expect this dataset to be iterated upon. \begin{itemize} \item The fatcat catalog updates its metadata continously\footnote{A changelog can currenly be followed here: \url{https://fatcat.wiki/changelog}} and web crawls are conducted regularly. Current processing pipelines cover raw reference snapshot creation and derivation of the graph structure, which allows to rerun processing based on updated data as it becomes available. \item Metadata extraction from PDFs depends on supervised machine learning models, which in turn depend on available training datasets. With additional crawls and metadata available we hope to improve models used for metadata extraction, improving yield and reducing data extraction artifacts in the process. \item As of this version, a number of raw reference docs remain unmatched, which means that neither exact nor fuzzy matching has detected a link to a known entity. On the one hand, this can hint at missing metadata. However, parts of the data will contain a reference to a catalogued entity, but in a specific, dense and harder to recover form. This also include improvements to the fuzzy matching approach. \item The reference dataset contains millions of URLs and their integration into the graph has been implemented as prototype. A full implementation requires a few data cleanup and normalization steps. \end{itemize} \section{Acknowledgements} This work is partially supported by a grant from the \emph{Andrew W. Mellon Foundation}. \section{Appendix A} A note on data quality: While we implement various data quality measures, real-world data, especially coming from many different sources will contain issues. Among other measures, we keep track of match reasons, especially for fuzzy matching to be able to zoom in on systematic errors more easily (see~Table~\ref{table:matches}). \begin{table}[] \footnotesize \captionsetup{font=normalsize} \begin{center} \begin{tabular}{@{}rlll@{}} \toprule \textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason} \\ \midrule 934932865 & crossref & exact & doi \\ 151366108 & fatcat-datacite & exact & doi \\ 65345275 & fatcat-pubmed & exact & pmid \\ 48778607 & fuzzy & strong & jaccardauthors \\ 42465250 & grobid & exact & doi \\ 29197902 & fatcat-pubmed & exact & doi \\ 19996327 & fatcat-crossref & exact & doi \\ 11996694 & fuzzy & strong & slugtitleauthormatch \\ 9157498 & fuzzy & strong & tokenizedauthors \\ 3547594 & grobid & exact & arxiv \\ 2310025 & fuzzy & exact & titleauthormatch \\ 1496515 & grobid & exact & pmid \\ 680722 & crossref & strong & jaccardauthors \\ 476331 & fuzzy & strong & versioneddoi \\ 449271 & grobid & exact & isbn \\ 230645 & fatcat-crossref & strong & jaccardauthors \\ 190578 & grobid & strong & jaccardauthors \\ 156657 & crossref & exact & isbn \\ 123681 & fatcat-pubmed & strong & jaccardauthors \\ 79328 & crossref & exact & arxiv \\ 57414 & crossref & strong & tokenizedauthors \\ 53480 & fuzzy & strong & pmiddoipair \\ 52453 & fuzzy & strong & dataciterelatedid \\ 47119 & grobid & strong & slugtitleauthormatch \\ 36774 & fuzzy & strong & arxivversion \\ % 35311 & fuzzy & strong & customieeearxiv \\ % 33863 & grobid & exact & pmcid \\ % 23504 & crossref & strong & slugtitleauthormatch \\ % 22753 & fatcat-crossref & strong & tokenizedauthors \\ % 17720 & grobid & exact & titleauthormatch \\ % 14656 & crossref & exact & titleauthormatch \\ % 14438 & grobid & strong & tokenizedauthors \\ % 7682 & fatcat-crossref & exact & arxiv \\ % 5972 & fatcat-crossref & exact & isbn \\ % 5525 & fatcat-pubmed & exact & arxiv \\ % 4290 & fatcat-pubmed & strong & tokenizedauthors \\ % 2745 & fatcat-pubmed & exact & isbn \\ % 2342 & fatcat-pubmed & strong & slugtitleauthormatch \\ % 2273 & fatcat-crossref & strong & slugtitleauthormatch \\ % 1960 & fuzzy & exact & workid \\ % 1150 & fatcat-crossref & exact & titleauthormatch \\ % 1041 & fatcat-pubmed & exact & titleauthormatch \\ % 895 & fuzzy & strong & figshareversion \\ % 317 & fuzzy & strong & titleartifact \\ % 82 & grobid & strong & titleartifact \\ % 33 & crossref & strong & titleartifact \\ % 5 & fuzzy & strong & custombsiundated \\ % 1 & fuzzy & strong & custombsisubdoc \\ % 1 & fatcat & exact & doi \\ \bottomrule \end{tabular} \vspace*{2mm} \caption{Table of match counts (top 25), reference provenance, match status and match reason. The match reason identifier encode a specific rule in the domain dependent verification process and are included for completeness - we do not include the details of each rule in this report.} \label{table:matches} \end{center} \end{table} \bibliographystyle{abbrv} % \bibliographystyle{plainnat} \bibliography{refs} \end{document}