aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-09-10 14:50:31 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-09-10 14:50:31 +0200
commit3b398716e69a56be9229b14ca7428a5a79de70b7 (patch)
tree7930a94c2db04a71c43c0cdff219216a5cb0e737
parent0e205c80d21c806b2779c2c3bc293e84a38b57b1 (diff)
downloadrefcat-3b398716e69a56be9229b14ca7428a5a79de70b7.tar.gz
refcat-3b398716e69a56be9229b14ca7428a5a79de70b7.zip
docs: tr tweaks
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.pdfbin105719 -> 105712 bytes
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex22
2 files changed, 11 insertions, 11 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index 11230c5..e6e1331 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
Binary files differ
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index c95a7d6..a9a3776 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -183,8 +183,8 @@ for both source and target). The majority of matches - 1,250,523,321 - is
established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN).
72,900,351 citations are established through fuzzy matching techniques.
Citations from the Open Citations COCI corpus\footnote{Reference dataset COCI
-v11, released 2021-09-04,
-\href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}}
+ v11, released 2021-09-04,
+ \href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}}
and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:cocicmp}.
\begin{table}[]
@@ -194,8 +194,8 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
\bf{Set} & \bf{Count} \\
\midrule
- COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
- \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
+ COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
+ \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
C $\cap$ R & xxx 1,007,539,966 \\
C $\setminus$ R & xxx 86,854,309 \\
R $\setminus$ C & xxx 295,884,246
@@ -228,11 +228,11 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
\begin{center}
\begin{tabular}{ll}
\toprule
- \bf{Edge type} & \bf{Count} \\
+ \bf{Edge type} & \bf{Count} \\
\midrule
doi-doi & xxx 1,178,488,264 \\
- target-open-library & 20,307,064 \\
- source-wikipedia & 1,386,941 \\
+ target-open-library & 20,307,064 \\
+ source-wikipedia & 1,386,941 \\
\end{tabular}
\vspace*{2mm}
\caption{Output structure, e.g. edges between documents that both have a doi (doi-doi).}
@@ -251,14 +251,14 @@ to generate updates to the dataset. Table~\ref{table:structure} lists the
counts for these links. Additionally, we are examining web links appearing in
references: after an initial cleaning procedure we currently find 25,405,592
web links\footnote{The cleaning process is necessary because OCR artifacts and
-other metadata issues exist in the data. Unfortunately, even after cleaning not
-all links will be in the form as originally intended by the authors.} in the
+ other metadata issues exist in the data. Unfortunately, even after cleaning not
+ all links will be in the form as originally intended by the authors.} in the
reference corpus, of which 4,827,688 have been preserved with an HTTP 200
status code in the Wayback
Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of
the Internet Archive. From a sample\footnote{In a sample of 8000 links we find
-only 6138 responding with a HTTP 200, whereas the rest of the links yields a
-variety of http status codes, like 404, 403, 500 and others.} we observe, that
+ only 6138 responding with a HTTP 200, whereas the rest of the links yields a
+ variety of http status codes, like 404, 403, 500 and others.} we observe, that
about 23\% of the links reference corpus links preserved at the Internet
Archive are not accessible on the world wide web currently - making targeted
web crawling and preservation of scholarly references an essential tool for