doc: update tables in tr

author: Martin Czygan <martin.czygan@gmail.com> 2021-09-10 18:47:42 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-09-10 18:47:42 +0200
commit: dd939b5d8ca7e7ad5fdd2022cd63de674043c234 (patch)
tree: fba1a423694c446440520ec27ecde68d7e762c8a /docs/TR-20210808100000-IA-WDS-REFCAT
parent: 3b398716e69a56be9229b14ca7428a5a79de70b7 (diff)
download: refcat-dd939b5d8ca7e7ad5fdd2022cd63de674043c234.tar.gz
refcat-dd939b5d8ca7e7ad5fdd2022cd63de674043c234.zip
2 files changed, 22 insertions, 21 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index e6e1331..a827f61 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index a9a3776..dc500dc 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -168,13 +168,13 @@ We release the first version of the \emph{refcat} dataset in an format used
 internally for storage and to serve queries (and which we call \emph{biblioref}
 or \emph{bref} for short). The dataset includes metadata from fatcat, the Open
 Library project and inbound links from the English Wikipedia.  The dataset is
-integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users
+integrated into the \href{https://fatcat.wiki}{fatcat.wiki website} and allows users
 to explore inbound and outbound
 references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}.
 
 The format records source and target (fatcat release and work) identifiers, a
-few attributes from the metadata (such as year or release stage) as well as
-information about the match status and provanance.
+few metadata attributes (such as year or release stage) as well as
+information about the match status and provenance.
 
 The dataset currently contains 1,323,423,672 citations across 76,327,662
 entities (55,123,635 unique source and 60,244,206 unique target work
@@ -191,20 +191,21 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
 	\begin{center}
 		\begin{tabular}{ll}
 			\toprule
-			\bf{Set}              & \bf{Count}        \\
+			\bf{Set}              & \bf{Count}    \\
 
 			\midrule
-			COCIv11 (C)           & 1,186,958,897     \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
-			\emph{refcat-doi} (R) & 1,303,424,212     \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
-			C $\cap$ R            & xxx 1,007,539,966 \\
-			C $\setminus$ R       & xxx 86,854,309    \\
-			R $\setminus$ C       & xxx 295,884,246
+			COCIv11 (C)           & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
+			\emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
+			C $\cap$ R            & 1,046,438,515 \\
+			C $\setminus$ R       & 140,520,382   \\ %  86,854,309    \\
+			R $\setminus$ C       & 256,985,697   \\ % xxx 295,884,246
 		\end{tabular}
 		\vspace*{2mm}
-		\caption{Comparison between Open Citations COCI corpus (v11, 2021-09-04)
-			and \emph{refcat-doi}, a subset of \emph{refcat} where entities
-			have a known DOI. At least 50\% of the 295,884,246 references only
-			in \emph{refcat-doi} come from links recorded within a specific dataset provider (GBIF, DOI prefix: 10.15468).}
+		\caption{Comparison between Open Citations COCI corpus (v11,
+			2021-09-04) and \emph{refcat-doi}, a subset of \emph{refcat} where
+			entities have a known DOI. At least 150,727,673 (58.7\%) of the 256,985,697 references in
+			\emph{refcat-doi} only record links within a specific dataset provider;
+			here GBIF with DOI prefix: 10.15468.}
 		\label{table:cocicmp}
 	\end{center}
 \end{table}
@@ -228,11 +229,11 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
 	\begin{center}
 		\begin{tabular}{ll}
 			\toprule
-			\bf{Edge type}      & \bf{Count}        \\
+			\bf{Edge type}      & \bf{Count}    \\
 			\midrule
-			doi-doi             & xxx 1,178,488,264 \\
-			target-open-library & 20,307,064        \\
-			source-wikipedia    & 1,386,941         \\
+			doi-doi             & 1,303,424,212 \\
+			target-open-library & 20,307,064    \\
+			source-wikipedia    & 1,386,941     \\
 		\end{tabular}
 		\vspace*{2mm}
 		\caption{Output structure, e.g. edges between documents that both have a doi (doi-doi).}
@@ -241,7 +242,7 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
 \end{table}
 
 We started to include non-traditional citations into the graph, such as links
-to books as recorded by the Open Library Project and links from the English
+to books as recorded by the Open Library project and links from the English
 Wikipedia to scholarly works. For links between Open Library we employ both
 identifier based and fuzzy matching; for Wikipedia references we used an
 existing dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing
@@ -259,9 +260,9 @@ Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of
 the Internet Archive. From a sample\footnote{In a sample of 8000 links we find
 	only 6138 responding with a HTTP 200, whereas the rest of the links yields a
 	variety of http status codes, like 404, 403, 500 and others.} we observe, that
-about 23\% of the links reference corpus links preserved at the Internet
-Archive are not accessible on the world wide web currently - making targeted
-web crawling and preservation of scholarly references an essential tool for
+about 23\% of the links in the reference corpus links preserved at the Internet
+Archive are not accessible on the world wide web currently\footnote{We used the \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command line link checking tool.} - making targeted
+web crawling and preservation of scholarly references an activity for
 maintaining citation integrity.
 
 % unpigz -c fatcat-refs-urllist-2021-06-17_lookup-20210714045637.tsv.gz| LC_ALL=C grep -F ')/' | grep -c -E "\W200\W"
author	Martin Czygan <martin.czygan@gmail.com>	2021-09-10 18:47:42 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-09-10 18:47:42 +0200
commit	dd939b5d8ca7e7ad5fdd2022cd63de674043c234 (patch)
tree	fba1a423694c446440520ec27ecde68d7e762c8a /docs/TR-20210808100000-IA-WDS-REFCAT
parent	3b398716e69a56be9229b14ca7428a5a79de70b7 (diff)
download	refcat-dd939b5d8ca7e7ad5fdd2022cd63de674043c234.tar.gz refcat-dd939b5d8ca7e7ad5fdd2022cd63de674043c234.zip