diff options
Diffstat (limited to 'docs/Simple')
-rw-r--r-- | docs/Simple/main.pdf | bin | 89394 -> 95379 bytes | |||
-rw-r--r-- | docs/Simple/main.tex | 70 |
2 files changed, 53 insertions, 17 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf Binary files differindex 067d829..399f5a2 100644 --- a/docs/Simple/main.pdf +++ b/docs/Simple/main.pdf diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex index 920b3ac..ca26e19 100644 --- a/docs/Simple/main.tex +++ b/docs/Simple/main.tex @@ -1,5 +1,6 @@ \documentclass[10pt,twocolumn]{article} \usepackage{simpleConference} +\usepackage[utf8]{inputenc} \usepackage{times} \usepackage{graphicx} \usepackage{natbib} @@ -12,10 +13,11 @@ \usepackage{datetime} \providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1} +\setlength{\parindent}{0pt} \begin{document} -\title{Archive Scholar Citation Dataset} +\title{Archive Scholar Reference Dataset} \author{Martin Czygan \\ \\ @@ -38,15 +40,17 @@ bnewbold@archive.org \\ \begin{abstract} As part of its scholarly data efforts, the Internet Archive releases a citation -graph dataset derived from scholarly publications and additional data sources. It is -composed of data gathered by the \href{https://fatcat.wiki}{fatcat cataloging project} and related -web-scale crawls targeting primary and secondary scholarly outputs. In -addition, relations are worked out between scholarly publications, web pages -and their archived copies, books from the Open Library project as well as -Wikipedia articles. This first version of the graph consists of over X nodes -and over Y edges. We release this dataset under a Z open license under the -collection at \href{https://archive.org/details/TODO-citation\_graph}{https://archive.org/details/TODO-citation\_graph}, as well as all code -used for derivation under an MIT license. +graph dataset (ASREF) derived from scholarly publications and additional data +sources. It is composed of data gathered by the fatcat cataloging +project\footnote{\url{https://fatcat.wiki}} and related web-scale crawls +targeting primary and secondary scholarly outputs. In addition, relations are +worked out between scholarly publications, web pages and their archived copies, +books from the Open Library project as well as Wikipedia articles. This first +version of the graph consists of over X nodes and over Y edges. We release this +dataset under a Z open license under the collection as an archive +item\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All code +used in the derivation process is releases under an MIT +license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}. \end{abstract} \keywords{Citation Graph, Web Archiving} @@ -76,8 +80,40 @@ available, marking a tipping point for open citations. \section{Related Work} +There are a few large scale citation dataset available today. COCI, the +``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first +released 2018-07-29. As of its most recent release on 2021-07-29, it contains +1,094,394,688 citations across 65,835,422 bibliographic resources. + +The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, +``a Wikimedia initiative to develop open citations and linked bibliographic +data to serve free knowledge'' continously adds citations to its data base and +as of 2021-06-28 tracks 253,719,394 citations across 39,994,937 +publications\footnote{\url{http://wikicite.org/statistics.html}}. + +Microsoft Academic Graph\footnote{A recent copy has been preserved at +\url{https://archive.org/details/mag-2021-06-07}} is comprised of a number of +entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} +with PaperReferences being one relation among many others. As of 2021-06-07 the +PaperReferences relation contains 1,832,226,781 edges across YYY bibliographic +entities. + +TODO: COCI MAG Wikicite Citeseer, Parsecit, Aminer, Semantic Scholar + + \section{Citation Dataset} +We release the first version of the ASREF dataset in an format used internally +for storage and display (and which we call \emph{biblioref}). The format +contains source and target fatcat release and work identifiers, as well as few +attributes from the metadata (such as year or release stage) as well as +information about the match provenance (like match status or reason). For ease +of use, we include DOI as well, if available. + +The dataset currently contains X unique bibliographic entities and Y citations. + + + \section{System Design} @@ -103,14 +139,14 @@ Table~\ref{table:fields}. \toprule \bf{Fields} & \bf{Share} \\ \midrule - \multicolumn{1}{l}{CN CRN|P|T| U| V| Y} & 14\% \\ + \multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 14\% \\ \multicolumn{1}{l}{DOI} & 14\% \\ - \multicolumn{1}{l}{CN|CRN|IS|P|T|U|V|Y} & 5\% \\ - \multicolumn{1}{l}{CN|CRN|DOI|U|V|Y} & 4\% \\ - \multicolumn{1}{l}{PMID|U} & 4\% \\ - \multicolumn{1}{l}{CN|CRN|DOI|T|V|Y} & 4\% \\ - \multicolumn{1}{l}{CN|CRN|Y} & 4\% \\ - \multicolumn{1}{l}{CN|CRN|DOI|V|Y} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ U $\cdot$ V $\cdot$ Y} & 4\% \\ + \multicolumn{1}{l}{PMID $\cdot$ U} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ T $\cdot$ V $\cdot$ Y} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y} & 4\% \\ + \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ V $\cdot$ Y} & 4\% \\ \end{tabular} \vspace*{2mm} \caption{Top 8 combinations of available fields in raw reference data |