2 files changed, 53 insertions, 17 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index 067d829..399f5a2 100644
--- a/docs/Simple/main.pdf
+++ b/docs/Simple/main.pdf
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index 920b3ac..ca26e19 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -1,5 +1,6 @@
 \documentclass[10pt,twocolumn]{article}
 \usepackage{simpleConference}
+\usepackage[utf8]{inputenc}
 \usepackage{times}
 \usepackage{graphicx}
 \usepackage{natbib}
@@ -12,10 +13,11 @@
 
 \usepackage{datetime}
 \providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
+\setlength{\parindent}{0pt}
 
 \begin{document}
 
-\title{Archive Scholar Citation Dataset}
+\title{Archive Scholar Reference Dataset}
 
 \author{Martin Czygan \\
 \\
@@ -38,15 +40,17 @@ bnewbold@archive.org  \\
 
 \begin{abstract}
 As part of its scholarly data efforts, the Internet Archive releases a citation
-graph dataset derived from scholarly publications and additional data sources. It is
-composed of data gathered by the \href{https://fatcat.wiki}{fatcat cataloging project} and related
-web-scale crawls targeting primary and secondary scholarly outputs. In
-addition, relations are worked out between scholarly publications, web pages
-and their archived copies, books from the Open Library project as well as
-Wikipedia articles. This first version of the graph consists of over X nodes
-and over Y edges. We release this dataset under a Z open license under the
-collection at \href{https://archive.org/details/TODO-citation\_graph}{https://archive.org/details/TODO-citation\_graph}, as well as all code
-used for derivation under an MIT license.
+graph dataset (ASREF) derived from scholarly publications and additional data
+sources. It is composed of data gathered by the fatcat cataloging
+project\footnote{\url{https://fatcat.wiki}} and related web-scale crawls
+targeting primary and secondary scholarly outputs. In addition, relations are
+worked out between scholarly publications, web pages and their archived copies,
+books from the Open Library project as well as Wikipedia articles. This first
+version of the graph consists of over X nodes and over Y edges. We release this
+dataset under a Z open license under the collection as an archive
+item\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All code
+used in the derivation process is releases under an MIT
+license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
 \end{abstract}
 
 \keywords{Citation Graph, Web Archiving}
@@ -76,8 +80,40 @@ available, marking a tipping point for open citations.
 
 \section{Related Work}
 
+There are a few large scale citation dataset available today. COCI, the
+``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
+released 2018-07-29. As of its most recent release on 2021-07-29, it contains
+1,094,394,688 citations across 65,835,422 bibliographic resources.
+
+The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
+``a Wikimedia initiative to develop open citations and linked bibliographic
+data to serve free knowledge'' continously adds citations to its data base and
+as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
+publications\footnote{\url{http://wikicite.org/statistics.html}}.
+
+Microsoft Academic Graph\footnote{A recent copy has been preserved at
+\url{https://archive.org/details/mag-2021-06-07}} is comprised of a number of
+entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
+with PaperReferences being one relation among many others. As of 2021-06-07 the
+PaperReferences relation contains 1,832,226,781 edges across YYY bibliographic
+entities.
+
+TODO: COCI MAG Wikicite Citeseer, Parsecit, Aminer, Semantic Scholar
+
+
 \section{Citation Dataset}
 
+We release the first version of the ASREF dataset in an format used internally
+for storage and display (and which we call \emph{biblioref}). The format
+contains source and target fatcat release and work identifiers, as well as few
+attributes from the metadata (such as year or release stage) as well as
+information about the match provenance (like match status or reason). For ease
+of use, we include DOI as well, if available.
+
+The dataset currently contains X unique bibliographic entities and Y citations.
+
+
+
 
 \section{System Design}
 
@@ -103,14 +139,14 @@ Table~\ref{table:fields}.
 \toprule
         \bf{Fields}                                    & \bf{Share} \\
 \midrule
-        \multicolumn{1}{l}{CN  CRN|P|T| U| V| Y}    & 14\%                              \\
+    \multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$  U $\cdot$  V $\cdot$ Y}    & 14\%                              \\
         \multicolumn{1}{l}{DOI}                 & 14\%                              \\
-        \multicolumn{1}{l}{CN|CRN|IS|P|T|U|V|Y} & 5\%                               \\
-        \multicolumn{1}{l}{CN|CRN|DOI|U|V|Y}    & 4\%                               \\
-        \multicolumn{1}{l}{PMID|U}              & 4\%                               \\
-        \multicolumn{1}{l}{CN|CRN|DOI|T|V|Y}    & 4\%                               \\
-        \multicolumn{1}{l}{CN|CRN|Y}            & 4\%                               \\
-        \multicolumn{1}{l}{CN|CRN|DOI|V|Y}      & 4\%                               \\
+        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\%                               \\
+        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ U $\cdot$ V $\cdot$ Y}    & 4\%                               \\
+        \multicolumn{1}{l}{PMID $\cdot$ U}              & 4\%                               \\
+        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ T $\cdot$ V $\cdot$ Y}    & 4\%                               \\
+        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y}            & 4\%                               \\
+        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ V $\cdot$ Y}      & 4\%                               \\
     \end{tabular}
     \vspace*{2mm}
     \caption{Top 8 combinations of available fields in raw reference data