docs: first round on report review corrections

author: Martin Czygan <martin.czygan@gmail.com> 2021-10-01 18:54:03 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-10-01 18:54:03 +0200
commit: 31cd17cf2a1e5611935cc86dc89a752f581e1a16 (patch)
tree: ea4834767e177a2e6c69e8ed35cac301e0b197d5
parent: 0a8cda0af54255de7bd0dbb029e3959248e3fe95 (diff)
download: refcat-31cd17cf2a1e5611935cc86dc89a752f581e1a16.tar.gz
refcat-31cd17cf2a1e5611935cc86dc89a752f581e1a16.zip
2 files changed, 25 insertions, 24 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index 830f25f..be9bda0 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index 0543612..7ac8e46 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -17,7 +17,7 @@
 
 \begin{document}
 
-\title{Refcat: The Fatcat Citation Graph}
+\title{Refcat: The Internet Archive Scholar Citation Graph}
 
 \author{Martin Czygan \\
 	\\
@@ -39,20 +39,20 @@
 
 
 \begin{abstract}
-	As part of its scholarly data efforts, the Internet Archive releases a
+	As part of its scholarly data efforts, the Internet Archive (IA) releases a
 	first version of a citation graph dataset, named \emph{refcat}, derived
 	from scholarly publications and additional data sources. It is composed of
 	data gathered by the fatcat cataloging
-	project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related
+	project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}} (the catalog that underpins IA Scholar), related
 	web-scale crawls targeting primary and secondary scholarly outputs, as well
 	as metadata from the Open
 	Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}}
 	project and
 	Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}.
 	This first version of the graph consists of over 1.3B citations. We release
-	this dataset under a CC0 Public Domain Dedication, accessible through an
-	archive
-	item\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}.
+	this dataset under a CC0 Public Domain Dedication, accessible through
+	Internet
+	Archive\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}.
 	The source code used for the derivation process, including exact and fuzzy
 	citation matching, is released under an MIT
 	license\footnote{\href{https://gitlab.com/internetarchive/refcat}{https://gitlab.com/internetarchive/refcat}}.
@@ -64,7 +64,7 @@
 
 \section{Introduction}
 
-The Internet Archive releases a first version of a citation graph dataset
+The Internet Archive released a first version of a citation graph dataset
 derived from a corpus of about 2.5B raw references gathered from metadata
 and data obtained by PDF extraction and annotation tools such as
 GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with
@@ -131,10 +131,10 @@ Projects and datasets centered around citations or containing citation data as
 a core component are COCI, the ``OpenCitations Index of Crossref open
 DOI-to-DOI citations'', which was first released
 2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}}
-and has been regularly updated since~\citep{peroni2020opencitations}. The
+and has been regularly updated~\citep{peroni2020opencitations}. The
 WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}}
 project, ``a Wikimedia initiative to develop open citations and linked
-bibliographic data to serve free knowledge'' continously adds citations to its
+bibliographic data to serve free knowledge'' continuously adds citations to its
 database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}.
 Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
 entities\footnote{\href{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
@@ -171,15 +171,16 @@ with \emph{PaperReferences} being one relation among many others.
 
 \section{Dataset}
 
-We release the first version of the \emph{refcat} dataset in a format used
+We released the first version of the \emph{refcat} dataset in a format used
 internally for storage and to serve queries (and which we call \emph{biblioref}
-or \emph{bref} for short). The dataset includes metadata from fatcat, the Open
-Library project and inbound links from the English Wikipedia.  The dataset is
-integrated into the \href{https://fatcat.wiki}{fatcat.wiki website} and allows users
-to explore inbound and outbound
+or \emph{bref} for short). The dataset includes metadata from fatcat (the
+catalog underpinning IA Scholar), the Open Library project and inbound links
+from the English Wikipedia.  The dataset is integrated into the
+\href{https://fatcat.wiki}{fatcat.wiki website} and allows users to explore
+inbound and outbound
 references\footnote{\href{https://guide.fatcat.wiki/reference\_graph.html}{https://guide.fatcat.wiki/reference\_graph.html}}.
 
-The format records source and target (fatcat release and work) identifiers, a
+The format records source and target identifiers, a
 few metadata attributes (such as year or release stage) as well as
 information about the match status and provenance.
 
@@ -196,16 +197,16 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
 
 \begin{table}[]
 	\begin{center}
-		\begin{tabular}{ll}
+		\begin{tabular}{lll}
 			\toprule
-			\bf{Set}              & \bf{Count}    \\
+			\bf{Set}              &                 & \bf{Count}    \\
 
 			\midrule
-			COCIv11 (C)           & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
-			\emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
-			C $\cap$ R            & 1,046,438,515 \\
-			C $\setminus$ R       & 140,520,382   \\ %  86,854,309    \\
-			R $\setminus$ C       & 256,985,697   \\ % xxx 295,884,246
+			COCIv11 (C)           &                 & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
+			\emph{refcat-doi} (R) &                 & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
+			C $\cap$ R            & overlap         & 1,046,438,515 \\
+			C $\setminus$ R       & COCIv11 only    & 140,520,382   \\ %  86,854,309    \\
+			R $\setminus$ C       & refcat-doi only & 256,985,697   \\ % xxx 295,884,246
 		\end{tabular}
 		\vspace*{2mm}
 		\caption{Comparison between Open Citations COCI corpus (v11,
@@ -251,7 +252,7 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
 \end{table}
 
 We started to include non-traditional citations into the graph, such as links
-to books as recorded by the Open Library project and links from the English
+to books included in Open Library and links from the English
 Wikipedia to scholarly works. For links between Open Library we employ both
 identifier based and fuzzy matching; for Wikipedia references we used a published dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing
 to upstream projects related to wikipedia citation extraction, such as
@@ -329,7 +330,7 @@ Reference data comes from two main sources: explicit bibliographic metadata and
 PDF extraction. The bibliographic metadata is taken from fatcat, which itself
 harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv,
 Datacite, DOAJ, dblp and others into its catalog (as the source permits, data
-is processed continously or in batches). Reference data from PDF documents has
+is processed continuously or in batches). Reference data from PDF documents has
 been extracted with GROBID\footnote{GROBID
 	\href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the
 TEI-XML results being cached locally in a key-value store accessible with an S3
author	Martin Czygan <martin.czygan@gmail.com>	2021-10-01 18:54:03 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-10-01 18:54:03 +0200
commit	31cd17cf2a1e5611935cc86dc89a752f581e1a16 (patch)
tree	ea4834767e177a2e6c69e8ed35cac301e0b197d5
parent	0a8cda0af54255de7bd0dbb029e3959248e3fe95 (diff)
download	refcat-31cd17cf2a1e5611935cc86dc89a752f581e1a16.tar.gz refcat-31cd17cf2a1e5611935cc86dc89a752f581e1a16.zip