From 4ef80e0424b25b700597a604686c5f794e8b36d2 Mon Sep 17 00:00:00 2001
From: Martin Czygan <martin.czygan@gmail.com>
Date: Thu, 5 Aug 2021 18:22:01 +0200
Subject: wip: paper

---
 docs/Simple/main.pdf | Bin 97848 -> 99828 bytes
 docs/Simple/main.tex | 107 +++++++++++++++++++++++++++++++++------------------
 2 files changed, 70 insertions(+), 37 deletions(-)

(limited to 'docs')

diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index 8fe89a9..469fa95 100644
Binary files a/docs/Simple/main.pdf and b/docs/Simple/main.pdf differ
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index e88e6fd..36f2074 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -42,13 +42,14 @@ bnewbold@archive.org  \\
 As part of its scholarly data efforts, the Internet Archive releases a citation
 graph dataset (ASREF) derived from scholarly publications and additional data
 sources. It is composed of data gathered by the fatcat cataloging
-project\footnote{\url{https://fatcat.wiki}} and related web-scale crawls
-targeting primary and secondary scholarly outputs. In addition, relations are
-worked out between scholarly publications, web pages and their archived copies,
-books from the Open Library project as well as Wikipedia articles. This first
-version of the graph consists of over X nodes and over Y edges. We release this
-dataset under a Z open license under the collection as an archive
-item\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All code
+project\footnote{\url{https://fatcat.wiki}}, related web-scale crawls targeting
+primary and secondary scholarly outputs, as well as metadata from the Open
+Library\footnote{\url{https://openlibrary.org}} project, information about
+archived web pages found in the Wayback
+Machine\footnote{\url{https://web.archive.org}} and
+Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
+graph consists of 1,323,423,672 citations. We release this dataset under a CC0 Public Domain Dedication, accessible through an
+archive collection\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All code
 used in the derivation process is releases under an MIT
 license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
 \end{abstract}
@@ -60,9 +61,11 @@ license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
 
 The Internet Archive releases a first version of a citation graph dataset
 derived from a raw corpus of about 2.5B references gathered from metadata and
-from data obtained by PDF extraction tools such as GROBID\cite{lopez2009grobid}.
+from data obtained by PDF extraction tools such as
+GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
+metadata from Open Library, the Wayback Machine and Wikipedia.
 The goal of this report is to describe briefly the current contents and the
-derivation of the Archive Scholar Citations Dataset (ASC). We expect
+derivation of the Archive Scholar Reference Dataset (ASREF). We expect
 this dataset to be iterated upon, with changes both in content and processing.
 
 Modern citation indexes can be traced back to the early computing age, when
@@ -94,9 +97,9 @@ publications\footnote{\url{http://wikicite.org/statistics.html}}.
 Microsoft Academic Graph\footnote{A recent copy has been preserved at
 \url{https://archive.org/details/mag-2021-06-07}} is comprised of a number of
 entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
-with PaperReferences being one relation among many others. As of 2021-06-07 the
-PaperReferences relation contains 1,832,226,781 edges across YYY bibliographic
-entities.
+with \emph{PaperReferences} being one relation among many others. As of 2021-06-07 the
+\emph{PaperReferences} relation contains 1,832,226,781 edges across 123,923,466
+bibliographic entities.
 
 Numerous other projects have been or are concerned with various aspects of
 citation discovery and curation, among them Semantic Scholar, CiteSeerX or
@@ -108,15 +111,45 @@ citations is not expected to shrink in the future.
 
 \section{Citation Dataset}
 
-We release the first version of the ASREF dataset in an format used internally
-for storage and display (and which we call \emph{biblioref}). The format
-contains source and target fatcat release and work identifiers, as well as few
-attributes from the metadata (such as year or release stage) as well as
+We release the first version of the Archive Scholar Reference (ASREF) dataset
+in an format used internally for storage and to serve queries (and which we
+call \emph{biblioref} or \emph{bref} for short). The dataset includes metadata
+from fatcat and the Open Library Project, links to archived pages in
+the Wayback Machine as well as inbound links from the English Wikipedia.
+
+The format contains source and target (fatcat release and work) identifiers, a
+few attributes from the metadata (such as year or release stage) as well as
 information about the match provenance (like match status or reason). For ease
 of use, we include DOI as well, if available.
 
-The dataset currently contains X unique bibliographic entities and Y citations.
+The dataset currently contains 1,323,423,672 citations across 76,327,662
+entities (55,123,635 unique source and 60,244,206 unique target work identifiers).
+The majority of matches - 1,250,523,321 - are established through identifier
+based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are
+established through fuzzy matching.
+
+The majority of DOI based matches between ASREF and COCI overlap, as can be
+seen in~\ref{table:cocicmp}.
+
+\begin{table}[]
+    \begin{center}
+    \begin{tabular}{ll}
+\toprule
+\bf{Dataset}          & \bf{Count} \\
 
+\midrule
+        COCI (C)        &   1,094,394,688     \\
+        ASREF-DOI (A)   &   1,303,589,144    \\
+        C $\cap$ A      &       \\
+        C $\cup$ A      &       \\
+        C $\setminus$ A &       \\
+        A $\setminus$ C &
+    \end{tabular}
+    \vspace*{2mm}
+    \caption{Comparison between COCI and ASREF-DOI, a subset of ASREF with DOI.}
+     \label{table:cocicmp}
+    \end{center}
+\end{table}
 
 TODO: how matches are established and a short note on overlap with COCI DOI.
 
@@ -144,22 +177,22 @@ Table~\ref{table:fields}.
     \begin{center}
     \begin{tabular}{ll}
 \toprule
-        \bf{Fields}                                    & \bf{Share} \\
+        \bf{Fields}                                    & \bf{Percentage} \\
 \midrule
     \multicolumn{1}{l}{CN $\cdot$ RN $\cdot$ P $\cdot$ T $\cdot$  U $\cdot$  V $\cdot$ Y}    & 14\%                              \\
-        \multicolumn{1}{l}{DOI}                 & 14\%                              \\
+    \multicolumn{1}{l}{\textbf{DOI}}                 & 14\%                              \\
         \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ IS $\cdot$ P $\cdot$ T $\cdot$ U $\cdot$ V $\cdot$ Y} & 5\%                               \\
-        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ U $\cdot$ V $\cdot$ Y}    & 4\%                               \\
-        \multicolumn{1}{l}{PMID $\cdot$ U}              & 4\%                               \\
-        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ T $\cdot$ V $\cdot$ Y}    & 4\%                               \\
+        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ U $\cdot$ V $\cdot$ Y}    & 4\%                               \\
+        \multicolumn{1}{l}{\textbf{PMID} $\cdot$ U}              & 4\%                               \\
+        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ T $\cdot$ V $\cdot$ Y}    & 4\%                               \\
         \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ Y}            & 4\%                               \\
-        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ DOI $\cdot$ V $\cdot$ Y}      & 4\%                               \\
+        \multicolumn{1}{l}{CN $\cdot$ CRN $\cdot$ \textbf{DOI} $\cdot$ V $\cdot$ Y}      & 4\%                               \\
     \end{tabular}
     \vspace*{2mm}
     \caption{Top 8 combinations of available fields in raw reference data
         accounting for about 53\% of the total data (CN = container name, CRN =
 contrib raw name, P = pages, T = title, U = unstructured, V = volume, IS =
-issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value.}
+issue, Y = year, DOI = doi, PMID = pmid). Unstructured fields may contain any value. Identifiers emphasized.}
     \label{table:fields}
 \end{center}
 \end{table}
@@ -189,10 +222,6 @@ candidate generation phase in order to improve recall, but we are strict during
 verification, in order to control precision.
 
 
-\section{Fuzzy Matching Approach}
-\section{Quality Assurance}
-
-
 \section{Future Work}
 
 As other dataset in this field we expect this dataset to be iterated upon.
@@ -200,23 +229,27 @@ As other dataset in this field we expect this dataset to be iterated upon.
 \begin{itemize}
     \item The fatcat catalog updates its metadata
         continously\footnote{A changelog can currenly be followed here:
-        \url{fatcat.wiki/changelog}} and web crawls are conducted regularly.
-        Current processing pipelines cover raw reference snapshot creation and
-        derivation the graph structure.
+        \url{https://fatcat.wiki/changelog}} and web crawls are conducted
+        regularly.  Current processing pipelines cover raw reference snapshot
+        creation and derivation the graph structure, which allows to rerun
+        processing based on updated data as it becomes available.
 
-    \item Metadata extraction from PDFs depends on machine learning
-        models, which in turn depend training sets. With additional crawls and
+    \item Metadata extraction from PDFs depends on supervised machine learning
+        models, which in turn depends training sets. With additional crawls and
         metadata available we hope to improve models used for metadata
-        extraction, reducing data extraction artifacts in the process.
+        extraction, improving yield and reducing data extraction artifacts in
+        the process.
 
-    \item As of this version, a significant number of raw reference
-        docs remain unmatched, which means that neither exact or fuzzy matching
-        can recover a link to a known entity. On the one
+    \item As of this version, a number of raw reference
+        docs remain unmatched, which means that neither exact nor fuzzy matching
+        can detect a link to a known entity. On the one
         hand, this can hint at missing metadata. However, parts of the data
         will contain a reference to a catalogued entity, but in a specific,
         dense and harder to recover form.
+        This also include improvements to fuzzy matching code.
     \end{itemize}
 
+
 \section{Acknowledgements}
 
 This work is partially supported by a grant from the \emph{Andrew W. Mellon
-- 
cgit v1.2.3