aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-08-05 14:53:59 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-08-05 14:53:59 +0200
commit7991ef04e4a21fd680bec71c04cca4d47e651ecd (patch)
treec72595f936bee64f7220b3162b9d5443f4b35ec9
parentaf34fb3565f50ff4d034731aca4489c6690a8197 (diff)
downloadrefcat-7991ef04e4a21fd680bec71c04cca4d47e651ecd.tar.gz
refcat-7991ef04e4a21fd680bec71c04cca4d47e651ecd.zip
wip: paper
-rw-r--r--docs/Simple/main.pdfbin89394 -> 95379 bytes
-rw-r--r--docs/Simple/main.tex70
2 files changed, 53 insertions, 17 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index 067d829..399f5a2 100644
--- a/docs/Simple/main.pdf
+++ b/docs/Simple/main.pdf
Binary files differ
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index 920b3ac..ca26e19 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -1,5 +1,6 @@
\documentclass[10pt,twocolumn]{article}
\usepackage{simpleConference}
+\usepackage[utf8]{inputenc}
\usepackage{times}
\usepackage{graphicx}
\usepackage{natbib}
@@ -12,10 +13,11 @@
\usepackage{datetime}
\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
+\setlength{\parindent}{0pt}
\begin{document}
-\title{Archive Scholar Citation Dataset}
+\title{Archive Scholar Reference Dataset}
\author{Martin Czygan \\
\\
@@ -38,15 +40,17 @@ bnewbold@archive.org \\
\begin{abstract}
As part of its scholarly data efforts, the Internet Archive releases a citation
-graph dataset derived from scholarly publications and additional data sources. It is
-composed of data gathered by the \href{https://fatcat.wiki}{fatcat cataloging project} and related
-web-scale crawls targeting primary and secondary scholarly outputs. In
-addition, relations are worked out between scholarly publications, web pages
-and their archived copies, books from the Open Library project as well as
-Wikipedia articles. This first version of the graph consists of over X nodes
-and over Y edges. We release this dataset under a Z open license under the
-collection at \href{https://archive.org/details/TODO-citation\_graph}{https://archive.org/details/TODO-citation\_graph}, as well as all code
-used for derivation under an MIT license.
+graph dataset (ASREF) derived from scholarly publications and additional data
+sources. It is composed of data gathered by the fatcat cataloging
+project\footnote{\url{https://fatcat.wiki}} and related web-scale crawls
+targeting primary and secondary scholarly outputs. In addition, relations are
+worked out between scholarly publications, web pages and their archived copies,
+books from the Open Library project as well as Wikipedia articles. This first
+version of the graph consists of over X nodes and over Y edges. We release this
+dataset under a Z open license under the collection as an archive
+item\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All code
+used in the derivation process is releases under an MIT
+license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
\end{abstract}
\keywords{Citation Graph, Web Archiving}
@@ -76,8 +80,40 @@ available, marking a tipping point for open citations.
\section{Related Work}
+There are a few large scale citation dataset available today. COCI, the
+``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
+released 2018-07-29. As of its most recent release on 2021-07-29, it contains
+1,094,394,688 citations across 65,835,422 bibliographic resources.
+
+The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+``a Wikimedia initiative to develop open citations and linked bibliographic
+data to serve free knowledge'' continously adds citations to its data base and
+as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
+publications\footnote{\url{http://wikicite.org/statistics.html}}.
+
+Microsoft Academic Graph\footnote{A recent copy has been preserved at
+\url{https://archive.org/details/mag-2021-06-07}} is comprised of a number of
+entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
+with PaperReferences being one relation among many others. As of 2021-06-07 the
+PaperReferences relation contains 1,832,226,781 edges across YYY bibliographic
+entities.
+
+TODO: COCI MAG Wikicite Citeseer, Parsecit, Aminer, Semantic Scholar
+
+
\section{Citation Dataset}
+We release the first version of the ASREF dataset in an format used internally
+for storage and display (and which we call \emph{biblioref}). The format
+contains source and target fatcat release and work identifiers, as well as few
+attributes from the metadata (such as year or release stage) as well as
+information about the match provenance (like match status or reason). For ease
+of use, we include DOI as well, if available.
+
+The dataset currently contains X unique bibliographic entities and Y citations.
+
+
+
\section{System Design}
@@ -103,14 +139,14 @@ Table~\ref{table:fields}.
\toprule
\bf{Fields} & \bf{Share} \\
\midrule
- \multicolumn{1}{l}{CN CRN|P|T| U| V| Y} & 14\% \\
+ \multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 14\% \\
\multicolumn{1}{l}{DOI} & 14\% \\
- \multicolumn{1}{l}{CN|CRN|IS|P|T|U|V|Y} & 5\% \\
- \multicolumn{1}{l}{CN|CRN|DOI|U|V|Y} & 4\% \\
- \multicolumn{1}{l}{PMID|U} & 4\% \\
- \multicolumn{1}{l}{CN|CRN|DOI|T|V|Y} & 4\% \\
- \multicolumn{1}{l}{CN|CRN|Y} & 4\% \\
- \multicolumn{1}{l}{CN|CRN|DOI|V|Y} & 4\% \\
+ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\% \\
+ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ U $\cdot$ V $\cdot$ Y} & 4\% \\
+ \multicolumn{1}{l}{PMID $\cdot$ U} & 4\% \\
+ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ T $\cdot$ V $\cdot$ Y} & 4\% \\
+ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y} & 4\% \\
+ \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ V $\cdot$ Y} & 4\% \\
\end{tabular}
\vspace*{2mm}
\caption{Top 8 combinations of available fields in raw reference data