\documentclass[10pt,twocolumn]{article} \usepackage{simpleConference} \usepackage{times} \usepackage{graphicx} \usepackage{natbib} \usepackage{doi} \usepackage{amssymb} \usepackage{url,hyperref} \usepackage{booktabs} % professional-quality tables \usepackage{amsfonts} % blackboard math symbols \usepackage{nicefrac} % compact symbols for 1/2, etc. \usepackage{datetime} \providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1} \begin{document} \title{Archive Scholar Citation Dataset} \author{Martin Czygan \\ \\ Internet Archive \\ San Francisco, California, USA \\ martin@archive.org \\ \and Bryan Newbold \\ \\ Internet Archive \\ San Francisco, California, USA \\ bnewbold@archive.org \\ \\ } \maketitle \thispagestyle{empty} \begin{abstract} As part of its scholarly data efforts, the Internet Archive releases a citation graph dataset derived from scholarly publications and additional data sources. It is composed of data gathered by the \href{https://fatcat.wiki}{fatcat cataloging project} and related web-scale crawls targeting primary and secondary scholarly outputs. In addition, relations are worked out between scholarly publications, web pages and their archived copies, books from the Open Library project as well as Wikipedia articles. This first version of the graph consists of over X nodes and over Y edges. We release this dataset under a Z open license under the collection at \href{https://archive.org/details/TODO-citation\_graph}{https://archive.org/details/TODO-citation\_graph}, as well as all code used for derivation under an MIT license. \end{abstract} \keywords{Citation Graph, Web Archiving} \section{Introduction} The Internet Archive releases a first version of a citation graph dataset derived from a raw corpus of about 2.5B references gathered from metadata and from data obtained by PDF extraction tools such as GROBID\cite{lopez2009grobid}. The goal of this report is to describe briefly the current contents and the derivation of the Archive Scholar Citations Dataset (ASC). We expect this dataset to be iterated upon, with changes both in content and processing. Modern citation indexes can be traced back to the early computing age, when projects like the Science Citation Index (1955)\citep{garfield2007evolution} were first devised, living on in existing commercial knowledge bases today. Open alternatives were started such as the Open Citations Corpus (OCC) in 2010 - the first version of which contained 6,325,178 individual references\citep{shotton2013publishing}. Other notable sources from that time include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last decade has seen an increase of more openly available reference dataset and citation projects, like Microsoft Academic\citep{sinha2015overview} and Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021, according to \citep{hutchins2021tipping} over 1B citations are publicly available, marking a tipping point for open citations. \section{Related Work} \section{Citation Dataset} \section{System Design} The constraints for the systems design are informed by the volume and the variety of the data. In total, the raw inputs amount to a few TB of textual content, mostly newline delimited JSON. More importantly, while the number of data fields is low, certain schemas are very partial with hundreds of different combinations of available field values found in the raw reference data. This is most likely caused by aggregators passing on reference data coming from hundreds of sources, each of which not necessarily agreeing on a common granularity for citation data and from artifacts of machine learning based structured data extraction tools. Each combination of fields may require a slightly different processing path. For example, references with an Arxiv identifier can be processed differently from references with only a title. Over 50\% of the raw reference data comes from a set of eight field manifestations, as listed in Table~\ref{table:fields}. \begin{table}[] \begin{center} \begin{tabular}{ll} \toprule \bf{Fields} & \bf{Share} \\ \midrule \multicolumn{1}{l}{CN CRN|P|T| U| V| Y} & 14\% \\ \multicolumn{1}{l}{DOI} & 14\% \\ \multicolumn{1}{l}{CN|CRN|IS|P|T|U|V|Y} & 5\% \\ \multicolumn{1}{l}{CN|CRN|DOI|U|V|Y} & 4\% \\ \multicolumn{1}{l}{PMID|U} & 4\% \\ \multicolumn{1}{l}{CN|CRN|DOI|T|V|Y} & 4\% \\ \multicolumn{1}{l}{CN|CRN|Y} & 4\% \\ \multicolumn{1}{l}{CN|CRN|DOI|V|Y} & 4\% \\ \end{tabular} \vspace*{2mm} \caption{Top 8 combinations of available fields in raw reference data accounting for about 53\% of the total data (CN = container name, CRN = contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS = issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value.} \label{table:fields} \end{center} \end{table} Overall, a map-reduce style approach is followed, which allows for some uniformity in the overall processing. We extract (key, document) tuples (as TSV) from the raw JSON data and sort by key. Then we group documents with the same key into groups and apply a function on each group in order to generate our target schema (currently named biblioref, or bref for short) or perform addition operations (such as deduplication). The key derivation can be exact (like an identifier like DOI, PMID, etc) or based on a normalization procedure, like a slugified title string. For identifier based matches we can generate the target biblioref schema directly. For fuzzy matching candidates, we pass possible match pairs through a verification procedure, which is implemented for release entity schema pairs. The current verification procedure is a domain dependent rule based verification, able to identify different versions of a publication, preprint-published pairs or or other kind of similar documents by calculating similarity metrics across title and authors. The fuzzy matching approach is applied on all reference documents, which only have a title, but no identifier. With a few schema conversions, fuzzy matching can be applied to Wikipedia articles and Open Library (edition) records as well. The aspect of precision and recall are represented by the two stages: we are generous in the match candidate generation phase in order to improve recall, but we are strict during verification, in order to control precision. \section{Fuzzy Matching Approach} \section{Quality Assurance} In general a short summarizing paragraph will do, and under no circumstances should the paragraph simply repeat material from the Abstract or Introduction. In some cases it's possible to now make the original claims more concrete, e.g., by referring to quantitative performance results. \section{Future Work} This material is important -- part of the value of a paper is showing how the work sets new research directions. I like bullet lists here. A couple of things to keep in mind: \begin{description} \item[$\bullet$] If you're actively engaged in follow-up work, say so. E.g.: ``We are currently extending the algorithm to... blah blah, and preliminary results are encouraging." This statement serves to mark your territory. \item[$\bullet$] Conversely, be aware that some researchers look to Future Work sections for research topics. My opinion is that there's nothing wrong with that -- consider it a compliment. \end{description} \section{Acknowledgements} Don't forget them or you'll have people with hurt feelings. Acknowledge anyone who contributed in any way: through discussions, feedback on drafts, implementation, etc. If in doubt about whether to include someone, include them. \section{Citations} \section{Appendix A} \bibliographystyle{abbrv} \bibliography{refs} \end{document}