diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-09-10 14:50:31 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-09-10 14:50:31 +0200 |
commit | 3b398716e69a56be9229b14ca7428a5a79de70b7 (patch) | |
tree | 7930a94c2db04a71c43c0cdff219216a5cb0e737 | |
parent | 0e205c80d21c806b2779c2c3bc293e84a38b57b1 (diff) | |
download | refcat-3b398716e69a56be9229b14ca7428a5a79de70b7.tar.gz refcat-3b398716e69a56be9229b14ca7428a5a79de70b7.zip |
docs: tr tweaks
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | bin | 105719 -> 105712 bytes | |||
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 22 |
2 files changed, 11 insertions, 11 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf Binary files differindex 11230c5..e6e1331 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index c95a7d6..a9a3776 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -183,8 +183,8 @@ for both source and target). The majority of matches - 1,250,523,321 - is established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are established through fuzzy matching techniques. Citations from the Open Citations COCI corpus\footnote{Reference dataset COCI -v11, released 2021-09-04, -\href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}} + v11, released 2021-09-04, + \href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}} and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:cocicmp}. \begin{table}[] @@ -194,8 +194,8 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c \bf{Set} & \bf{Count} \\ \midrule - COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l - \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv + COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l + \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv C $\cap$ R & xxx 1,007,539,966 \\ C $\setminus$ R & xxx 86,854,309 \\ R $\setminus$ C & xxx 295,884,246 @@ -228,11 +228,11 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c \begin{center} \begin{tabular}{ll} \toprule - \bf{Edge type} & \bf{Count} \\ + \bf{Edge type} & \bf{Count} \\ \midrule doi-doi & xxx 1,178,488,264 \\ - target-open-library & 20,307,064 \\ - source-wikipedia & 1,386,941 \\ + target-open-library & 20,307,064 \\ + source-wikipedia & 1,386,941 \\ \end{tabular} \vspace*{2mm} \caption{Output structure, e.g. edges between documents that both have a doi (doi-doi).} @@ -251,14 +251,14 @@ to generate updates to the dataset. Table~\ref{table:structure} lists the counts for these links. Additionally, we are examining web links appearing in references: after an initial cleaning procedure we currently find 25,405,592 web links\footnote{The cleaning process is necessary because OCR artifacts and -other metadata issues exist in the data. Unfortunately, even after cleaning not -all links will be in the form as originally intended by the authors.} in the + other metadata issues exist in the data. Unfortunately, even after cleaning not + all links will be in the form as originally intended by the authors.} in the reference corpus, of which 4,827,688 have been preserved with an HTTP 200 status code in the Wayback Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of the Internet Archive. From a sample\footnote{In a sample of 8000 links we find -only 6138 responding with a HTTP 200, whereas the rest of the links yields a -variety of http status codes, like 404, 403, 500 and others.} we observe, that + only 6138 responding with a HTTP 200, whereas the rest of the links yields a + variety of http status codes, like 404, 403, 500 and others.} we observe, that about 23\% of the links reference corpus links preserved at the Internet Archive are not accessible on the world wide web currently - making targeted web crawling and preservation of scholarly references an essential tool for |