diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-08-08 15:18:29 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-08-08 15:18:29 +0200 |
commit | bd66b58cded2c2c7e7b7e5d374434d6531dd70de (patch) | |
tree | 00417812b9787ab4492e2c590fcf1bf6f4b576e7 /docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | |
parent | bb64b3aa62267676302e75f0ca44157b514beec4 (diff) | |
download | refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.tar.gz refcat-bd66b58cded2c2c7e7b7e5d374434d6531dd70de.zip |
docs: cleanup and naming
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT/main.tex')
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 362 |
1 files changed, 362 insertions, 0 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex new file mode 100644 index 0000000..e4febd9 --- /dev/null +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -0,0 +1,362 @@ +\documentclass[hidelinks,10pt,twocolumn]{article} +\usepackage{simpleConference} +\usepackage[utf8]{inputenc} +\usepackage{times} +\usepackage{graphicx} +\usepackage{natbib} +\usepackage{doi} +\usepackage{amssymb} +\usepackage{url,hyperref} +\usepackage{booktabs} % professional-quality tables +\usepackage{amsfonts} % blackboard math symbols +\usepackage{nicefrac} % compact symbols for 1/2, etc. +\usepackage{caption} + +\usepackage{datetime} +\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1} +\setlength{\parindent}{0pt} + +\begin{document} + +\title{Fatcat Reference Dataset} + +\author{Martin Czygan \\ + \\ + Internet Archive \\ + San Francisco, California, USA \\ + martin@archive.org \\ + \and + Bryan Newbold \\ + \\ + Internet Archive \\ + San Francisco, California, USA \\ + bnewbold@archive.org \\ + \\ +} + + +\maketitle +\thispagestyle{empty} + + +\begin{abstract} + As part of its scholarly data efforts, the Internet Archive releases a first version of a citation + graph dataset, named \emph{refcat}, derived from scholarly publications and + additional data sources. It is composed of data gathered by the fatcat + cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale + crawls targeting primary and secondary scholarly outputs, as well as metadata + from the Open Library\footnote{\url{https://openlibrary.org}} project and + Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the + graph consists of 1,323,423,672 citations. We release this dataset under a CC0 + Public Domain Dedication, accessible through an archive + item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}. All + code used in the derivation process is released under an MIT + license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}. +\end{abstract} + +\keywords{Citation Graph, Web Archiving} + +\section{Introduction} + + +The Internet Archive releases a first version of a citation graph dataset +derived from a raw corpus of about 2.5B references gathered from metadata and +data obtained by PDF extraction tools such as +GROBID\cite{lopez2009grobid}. Additionally, we consider integration with +metadata from Open Library and Wikipedia. +The goal of this report is to describe briefly the current contents and the +derivation of the dataset. We expect +this dataset to be iterated upon, with changes both in content and processing. + +Modern citation indexes can be traced back to the early computing age, when +projects like the Science Citation Index (1955)\citep{garfield2007evolution} +were first devised, living on in existing commercial knowledge bases today. +Open alternatives were started such as the Open Citations Corpus (OCC) in 2010 +- the first version of which contained 6,325,178 individual +references\citep{shotton2013publishing}. Other notable early projects +include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last +decade has seen the emergence of more openly available, large scale +citation projects like Microsoft Academic\citep{sinha2015overview} or the +Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021, +according to \citep{hutchins2021tipping} over 1B citations are publicly +available, marking a tipping point for this category of data. + +\section{Related Work} + +There are a few large scale citation dataset available today. COCI, the +``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first +released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on +2021-07-29, it contains +1,094,394,688 citations across 65,835,422 bibliographic +resources\citep{peroni2020opencitations}. + +The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, +``a Wikimedia initiative to develop open citations and linked bibliographic +data to serve free knowledge'' continously adds citations to its database and +as of 2021-06-28 tracks 253,719,394 citations across 39,994,937 +publications\footnote{\url{http://wikicite.org/statistics.html}}. + +Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of +entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} +with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at + \url{https://archive.org/details/mag-2021-06-07}} the +\emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466 +bibliographic entities. + +Numerous other projects have been or are concerned with various aspects of +citation discovery and curation as part their feature set, among them Semantic +Scholar\citep{fricke2018semantic}, CiteSeerX\citep{li2006citeseerx} or Aminer\citep{tang2016aminer}. + +As mentioned in \citep{hutchins2021tipping}, the number of openly available +citations is not expected to shrink in the future. + + +\section{Dataset} + +We release the first version of the \emph{refcat} dataset in an format used +internally for storage and to serve queries (and which we call \emph{biblioref} +or \emph{bref} for short). The dataset includes metadata from fatcat, the +Open Library Project and inbound links from the English Wikipedia. The fatcat +project itself aggregates data from variety of open data sources, such as +Crossref\citep{crossref}, PubMed\citep{canese2013pubmed}, +DataCite\citep{brase2009datacite}, DOAJ\citep{doaj}, dblp\citep{ley2002dblp} and others, +as well as metadata generated from analysis of data preserved at the Internet +Archive and active crawls of publication sites on the web. + +The dataset is +integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users +to explore inbound and outbound references\cite{fatcatguidereferencegraph}. + +The format records source and target (fatcat release and work) identifiers, a +few attributes from the metadata (such as year or release stage) as well as +information about the match status and provanance. + +The dataset currently contains 1,323,423,672 citations across 76,327,662 +entities (55,123,635 unique source and 60,244,206 unique target work +identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI +for both source and target). +The majority of matches - 1,250,523,321 - are established through identifier +based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are +established through fuzzy matching techniques. + +The majority of citations between \emph{refcat} and COCI overlap, as can be +seen in~Table~\ref{table:cocicmp}. + +\begin{table}[] + \begin{center} + \begin{tabular}{ll} + \toprule + \bf{Set} & \bf{Count} \\ + + \midrule + COCI (C) & 1,094,394,688 \\ + \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst + C $\cap$ R & 1,007,539,966 \\ + C $\setminus$ R & 86,854,309 \\ + R $\setminus$ C & 295,884,246 + \end{tabular} + \vspace*{2mm} + \caption{Comparison between COCI and \emph{refcat-doi}, a subset of + \emph{refcat} where entities have a known DOI. At least 50\% of the + 295,884,246 references only in \emph{refcat-doi} come from links + recorded within a specific dataset provider (GBIF, DOI prefix: + 10.15468).} + \label{table:cocicmp} + \end{center} +\end{table} + +% zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst +% zstdcat -T0 uniq_34.tsv.zst | pv -l | LC_ALL=C cut -f3,4 | zstd -c -T0 > uniq_34_doi.tsv.zst +% find . -name "*.csv" | parallel -j 16 "LC_ALL=C grep -v ^oci, {} | LC_ALL=C cut -d, -f2,3" | pv -l | zstd -c -T0 > ../6741422v10_doi_only.csv.zst + + +\section{System Design} + +The constraints for the systems design are informed by the volume and the +variety of the data. The capability to run the whole graph derivation on a +single machine was a minor goal as well. In total, the raw inputs amount to a +few terabytes of textual content, mostly newline delimited JSON. More +importantly, while the number of data fields is low, certain schemas are very +partial with hundreds of different combinations of available field values found +in the raw reference data. This is most likely caused by aggregators passing on +reference data coming from hundreds of sources, each of which not necessarily +agreeing on a common granularity for citation data and from artifacts of +machine learning based structured data extraction tools. + +Each combination of fields may require a slightly different processing path. +For example, references with an Arxiv identifier can be processed differently +from references with only a title. Over 50\% of the raw reference data comes +from a set of eight field set manifestations, as listed in +Table~\ref{table:fields}. + +\begin{table}[] + \begin{center} + \begin{tabular}{ll} + \toprule + \bf{Fields} & \bf{Percentage} \\ + \midrule + \multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 14\% \\ + \multicolumn{1}{l}{\textbf{DOI}} & 14\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ U $\cdot$ V $\cdot$ Y} & 4\% \\ + \multicolumn{1}{l}{\textbf{PMID} $\cdot$ U} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ T $\cdot$ V $\cdot$ Y} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ V $\cdot$ Y} & 4\% \\ + \end{tabular} + \vspace*{2mm} + \caption{Top 8 combinations of available fields in raw reference data + accounting for about 53\% of the total data (CN = container name, CRN = + contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS = + issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value. Identifiers emphasized.} + \label{table:fields} + \end{center} +\end{table} + +Overall, a map-reduce style\citep{dean2010mapreduce} approach is +followed\footnote{While the operations are similar, the processing is not + distributed but runs on a single machine. For space efficiency, zstd\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows +for some +uniformity in the overall processing. We extract (key, document) tuples (as +TSV) from the raw JSON data and sort by key. We then group documents with the +same key and apply a function on each group in order to generate +our target schema or perform +additional operations such as deduplication or fusion of matched and unmatched references. + +The key derivation can be exact (via an identifier like DOI, PMID, etc) or +based on a value normalization, like slugifying a title string. For identifier +based matches we can generate the target schema directly. For fuzzy matching +candidates, we pass possible match pairs through a verification procedure, +which is implemented for \emph{release entity}\footnote{\url{https://guide.fatcat.wiki/entity_release.html}.} pairs. This procedure is a +domain dependent rule based verification, able to identify different versions +of a publication, preprint-published pairs and documents, which are +are similar by various metrics calculated over title and author fields. The fuzzy matching +approach is applied on all reference documents without identifier (a title is +currently required). + +With a few schema conversions, fuzzy matching can be applied to Wikipedia +articles and Open Library (edition) records as well. The aspect of precision +and recall are represented by the two stages: we are generous in the match +candidate generation phase in order to improve recall, but we are strict during +verification, in order to control precision. Quality assurance for verification is +implemented through a growing list of test cases of real examples from the catalog and +their expected or desired match status\footnote{The list can be found under: + \url{https://gitlab.com/internetarchive/cgraph/-/blob/master/skate/testdata/verify.csv}. + It is helpful to keep this test suite independent of any specific programming language.}. + + +\section{Limitations and Future Work} + +As other dataset in this field we expect this dataset to be iterated upon. + +\begin{itemize} + \item The fatcat catalog updates its metadata + continously\footnote{A changelog can currenly be followed here: + \url{https://fatcat.wiki/changelog}} and web crawls are conducted + regularly. Current processing pipelines cover raw reference snapshot + creation and derivation of the graph structure, which allows to rerun + processing based on updated data as it becomes available. + + \item Metadata extraction from PDFs depends on supervised machine learning + models, which in turn depend on available training datasets. With additional crawls and + metadata available we hope to improve models used for metadata + extraction, improving yield and reducing data extraction artifacts in + the process. + + \item As of this version, a number of raw reference + docs remain unmatched, which means that neither exact nor fuzzy matching + has detected a link to a known entity. On the one + hand, this can hint at missing metadata. However, parts of the data + will contain a reference to a catalogued entity, but in a specific, + dense and harder to recover form. + This also include improvements to the fuzzy matching approach. + \item The reference dataset contains millions of URLs and their integration + into the graph has been implemented as prototype. A full implementation + requires a few data cleanup and normalization steps. +\end{itemize} + +\section{Acknowledgements} + +This work is partially supported by a grant from the \emph{Andrew W. Mellon + Foundation}. + + +\section{Appendix A} + + +A note on data quality: While we implement various data quality measures, +real-world data, especially coming from many different sources will contain +issues. Among other measures, we keep track of match reasons, +especially for fuzzy matching to be able to zoom in on systematic errors +more easily (see~Table~\ref{table:matches}). + +\begin{table}[] + \footnotesize + \captionsetup{font=normalsize} + \begin{center} + \begin{tabular}{@{}rlll@{}} + \toprule + \textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason} \\ \midrule + 934932865 & crossref & exact & doi \\ + 151366108 & fatcat-datacite & exact & doi \\ + 65345275 & fatcat-pubmed & exact & pmid \\ + 48778607 & fuzzy & strong & jaccardauthors \\ + 42465250 & grobid & exact & doi \\ + 29197902 & fatcat-pubmed & exact & doi \\ + 19996327 & fatcat-crossref & exact & doi \\ + 11996694 & fuzzy & strong & slugtitleauthormatch \\ + 9157498 & fuzzy & strong & tokenizedauthors \\ + 3547594 & grobid & exact & arxiv \\ + 2310025 & fuzzy & exact & titleauthormatch \\ + 1496515 & grobid & exact & pmid \\ + 680722 & crossref & strong & jaccardauthors \\ + 476331 & fuzzy & strong & versioneddoi \\ + 449271 & grobid & exact & isbn \\ + 230645 & fatcat-crossref & strong & jaccardauthors \\ + 190578 & grobid & strong & jaccardauthors \\ + 156657 & crossref & exact & isbn \\ + 123681 & fatcat-pubmed & strong & jaccardauthors \\ + 79328 & crossref & exact & arxiv \\ + 57414 & crossref & strong & tokenizedauthors \\ + 53480 & fuzzy & strong & pmiddoipair \\ + 52453 & fuzzy & strong & dataciterelatedid \\ + 47119 & grobid & strong & slugtitleauthormatch \\ + 36774 & fuzzy & strong & arxivversion \\ + % 35311 & fuzzy & strong & customieeearxiv \\ + % 33863 & grobid & exact & pmcid \\ + % 23504 & crossref & strong & slugtitleauthormatch \\ + % 22753 & fatcat-crossref & strong & tokenizedauthors \\ + % 17720 & grobid & exact & titleauthormatch \\ + % 14656 & crossref & exact & titleauthormatch \\ + % 14438 & grobid & strong & tokenizedauthors \\ + % 7682 & fatcat-crossref & exact & arxiv \\ + % 5972 & fatcat-crossref & exact & isbn \\ + % 5525 & fatcat-pubmed & exact & arxiv \\ + % 4290 & fatcat-pubmed & strong & tokenizedauthors \\ + % 2745 & fatcat-pubmed & exact & isbn \\ + % 2342 & fatcat-pubmed & strong & slugtitleauthormatch \\ + % 2273 & fatcat-crossref & strong & slugtitleauthormatch \\ + % 1960 & fuzzy & exact & workid \\ + % 1150 & fatcat-crossref & exact & titleauthormatch \\ + % 1041 & fatcat-pubmed & exact & titleauthormatch \\ + % 895 & fuzzy & strong & figshareversion \\ + % 317 & fuzzy & strong & titleartifact \\ + % 82 & grobid & strong & titleartifact \\ + % 33 & crossref & strong & titleartifact \\ + % 5 & fuzzy & strong & custombsiundated \\ + % 1 & fuzzy & strong & custombsisubdoc \\ + % 1 & fatcat & exact & doi \\ \bottomrule + \end{tabular} + \vspace*{2mm} + \caption{Table of match counts (top 25), reference provenance, match status and + match reason. The match reason identifier encode a specific rule in the domain + dependent verification process and are included for completeness - we do not + include the details of each rule in this report.} + \label{table:matches} + \end{center} +\end{table} + +\bibliographystyle{abbrv} +% \bibliographystyle{plainnat} +\bibliography{refs} +\end{document} |