docs: tr tweaks

author: Martin Czygan <martin.czygan@gmail.com> 2021-09-10 14:50:31 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-09-10 14:50:31 +0200
commit: 3b398716e69a56be9229b14ca7428a5a79de70b7 (patch)
tree: 7930a94c2db04a71c43c0cdff219216a5cb0e737
parent: 0e205c80d21c806b2779c2c3bc293e84a38b57b1 (diff)
download: refcat-3b398716e69a56be9229b14ca7428a5a79de70b7.tar.gz
refcat-3b398716e69a56be9229b14ca7428a5a79de70b7.zip
2 files changed, 11 insertions, 11 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index 11230c5..e6e1331 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index c95a7d6..a9a3776 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -183,8 +183,8 @@ for both source and target).  The majority of matches - 1,250,523,321 - is
 established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN).
 72,900,351 citations are established through fuzzy matching techniques.
 Citations from the Open Citations COCI corpus\footnote{Reference dataset COCI
-v11, released 2021-09-04,
-\href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}}
+	v11, released 2021-09-04,
+	\href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}}
 and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:cocicmp}.
 
 \begin{table}[]
@@ -194,8 +194,8 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
 			\bf{Set}              & \bf{Count}        \\
 
 			\midrule
-			COCIv11 (C)              & 1,186,958,897     \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
-			\emph{refcat-doi} (R) &  1,303,424,212    \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
+			COCIv11 (C)           & 1,186,958,897     \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
+			\emph{refcat-doi} (R) & 1,303,424,212     \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
 			C $\cap$ R            & xxx 1,007,539,966 \\
 			C $\setminus$ R       & xxx 86,854,309    \\
 			R $\setminus$ C       & xxx 295,884,246
@@ -228,11 +228,11 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
 	\begin{center}
 		\begin{tabular}{ll}
 			\toprule
-			\bf{Edge type}      & \bf{Count}    \\
+			\bf{Edge type}      & \bf{Count}        \\
 			\midrule
 			doi-doi             & xxx 1,178,488,264 \\
-			target-open-library & 20,307,064     \\
-			source-wikipedia    & 1,386,941     \\
+			target-open-library & 20,307,064        \\
+			source-wikipedia    & 1,386,941         \\
 		\end{tabular}
 		\vspace*{2mm}
 		\caption{Output structure, e.g. edges between documents that both have a doi (doi-doi).}
@@ -251,14 +251,14 @@ to generate updates to the dataset. Table~\ref{table:structure} lists the
 counts for these links. Additionally, we are examining web links appearing in
 references: after an initial cleaning procedure we currently find 25,405,592
 web links\footnote{The cleaning process is necessary because OCR artifacts and
-other metadata issues exist in the data. Unfortunately, even after cleaning not
-all links will be in the form as originally intended by the authors.} in the
+	other metadata issues exist in the data. Unfortunately, even after cleaning not
+	all links will be in the form as originally intended by the authors.} in the
 reference corpus, of which 4,827,688 have been preserved with an HTTP 200
 status code in the Wayback
 Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of
 the Internet Archive. From a sample\footnote{In a sample of 8000 links we find
-only 6138 responding with a HTTP 200, whereas the rest of the links yields a
-variety of http status codes, like 404, 403, 500 and others.} we observe, that
+	only 6138 responding with a HTTP 200, whereas the rest of the links yields a
+	variety of http status codes, like 404, 403, 500 and others.} we observe, that
 about 23\% of the links reference corpus links preserved at the Internet
 Archive are not accessible on the world wide web currently - making targeted
 web crawling and preservation of scholarly references an essential tool for
author	Martin Czygan <martin.czygan@gmail.com>	2021-09-10 14:50:31 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-09-10 14:50:31 +0200
commit	3b398716e69a56be9229b14ca7428a5a79de70b7 (patch)
tree	7930a94c2db04a71c43c0cdff219216a5cb0e737
parent	0e205c80d21c806b2779c2c3bc293e84a38b57b1 (diff)
download	refcat-3b398716e69a56be9229b14ca7428a5a79de70b7.tar.gz refcat-3b398716e69a56be9229b14ca7428a5a79de70b7.zip