aboutsummaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.pdfbin105712 -> 105989 bytes
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex43
2 files changed, 22 insertions, 21 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index e6e1331..a827f61 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
Binary files differ
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index a9a3776..dc500dc 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -168,13 +168,13 @@ We release the first version of the \emph{refcat} dataset in an format used
internally for storage and to serve queries (and which we call \emph{biblioref}
or \emph{bref} for short). The dataset includes metadata from fatcat, the Open
Library project and inbound links from the English Wikipedia. The dataset is
-integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users
+integrated into the \href{https://fatcat.wiki}{fatcat.wiki website} and allows users
to explore inbound and outbound
references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}.
The format records source and target (fatcat release and work) identifiers, a
-few attributes from the metadata (such as year or release stage) as well as
-information about the match status and provanance.
+few metadata attributes (such as year or release stage) as well as
+information about the match status and provenance.
The dataset currently contains 1,323,423,672 citations across 76,327,662
entities (55,123,635 unique source and 60,244,206 unique target work
@@ -191,20 +191,21 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
\begin{center}
\begin{tabular}{ll}
\toprule
- \bf{Set} & \bf{Count} \\
+ \bf{Set} & \bf{Count} \\
\midrule
- COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
- \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
- C $\cap$ R & xxx 1,007,539,966 \\
- C $\setminus$ R & xxx 86,854,309 \\
- R $\setminus$ C & xxx 295,884,246
+ COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l
+ \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv
+ C $\cap$ R & 1,046,438,515 \\
+ C $\setminus$ R & 140,520,382 \\ % 86,854,309 \\
+ R $\setminus$ C & 256,985,697 \\ % xxx 295,884,246
\end{tabular}
\vspace*{2mm}
- \caption{Comparison between Open Citations COCI corpus (v11, 2021-09-04)
- and \emph{refcat-doi}, a subset of \emph{refcat} where entities
- have a known DOI. At least 50\% of the 295,884,246 references only
- in \emph{refcat-doi} come from links recorded within a specific dataset provider (GBIF, DOI prefix: 10.15468).}
+ \caption{Comparison between Open Citations COCI corpus (v11,
+ 2021-09-04) and \emph{refcat-doi}, a subset of \emph{refcat} where
+ entities have a known DOI. At least 150,727,673 (58.7\%) of the 256,985,697 references in
+ \emph{refcat-doi} only record links within a specific dataset provider;
+ here GBIF with DOI prefix: 10.15468.}
\label{table:cocicmp}
\end{center}
\end{table}
@@ -228,11 +229,11 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
\begin{center}
\begin{tabular}{ll}
\toprule
- \bf{Edge type} & \bf{Count} \\
+ \bf{Edge type} & \bf{Count} \\
\midrule
- doi-doi & xxx 1,178,488,264 \\
- target-open-library & 20,307,064 \\
- source-wikipedia & 1,386,941 \\
+ doi-doi & 1,303,424,212 \\
+ target-open-library & 20,307,064 \\
+ source-wikipedia & 1,386,941 \\
\end{tabular}
\vspace*{2mm}
\caption{Output structure, e.g. edges between documents that both have a doi (doi-doi).}
@@ -241,7 +242,7 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c
\end{table}
We started to include non-traditional citations into the graph, such as links
-to books as recorded by the Open Library Project and links from the English
+to books as recorded by the Open Library project and links from the English
Wikipedia to scholarly works. For links between Open Library we employ both
identifier based and fuzzy matching; for Wikipedia references we used an
existing dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing
@@ -259,9 +260,9 @@ Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of
the Internet Archive. From a sample\footnote{In a sample of 8000 links we find
only 6138 responding with a HTTP 200, whereas the rest of the links yields a
variety of http status codes, like 404, 403, 500 and others.} we observe, that
-about 23\% of the links reference corpus links preserved at the Internet
-Archive are not accessible on the world wide web currently - making targeted
-web crawling and preservation of scholarly references an essential tool for
+about 23\% of the links in the reference corpus links preserved at the Internet
+Archive are not accessible on the world wide web currently\footnote{We used the \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command line link checking tool.} - making targeted
+web crawling and preservation of scholarly references an activity for
maintaining citation integrity.
% unpigz -c fatcat-refs-urllist-2021-06-17_lookup-20210714045637.tsv.gz| LC_ALL=C grep -F ')/' | grep -c -E "\W200\W"