diff options
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | bin | 105712 -> 105989 bytes | |||
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 43 |
2 files changed, 22 insertions, 21 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf Binary files differindex e6e1331..a827f61 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index a9a3776..dc500dc 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -168,13 +168,13 @@ We release the first version of the \emph{refcat} dataset in an format used internally for storage and to serve queries (and which we call \emph{biblioref} or \emph{bref} for short). The dataset includes metadata from fatcat, the Open Library project and inbound links from the English Wikipedia. The dataset is -integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users +integrated into the \href{https://fatcat.wiki}{fatcat.wiki website} and allows users to explore inbound and outbound references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}. The format records source and target (fatcat release and work) identifiers, a -few attributes from the metadata (such as year or release stage) as well as -information about the match status and provanance. +few metadata attributes (such as year or release stage) as well as +information about the match status and provenance. The dataset currently contains 1,323,423,672 citations across 76,327,662 entities (55,123,635 unique source and 60,244,206 unique target work @@ -191,20 +191,21 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c \begin{center} \begin{tabular}{ll} \toprule - \bf{Set} & \bf{Count} \\ + \bf{Set} & \bf{Count} \\ \midrule - COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l - \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv - C $\cap$ R & xxx 1,007,539,966 \\ - C $\setminus$ R & xxx 86,854,309 \\ - R $\setminus$ C & xxx 295,884,246 + COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l + \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv + C $\cap$ R & 1,046,438,515 \\ + C $\setminus$ R & 140,520,382 \\ % 86,854,309 \\ + R $\setminus$ C & 256,985,697 \\ % xxx 295,884,246 \end{tabular} \vspace*{2mm} - \caption{Comparison between Open Citations COCI corpus (v11, 2021-09-04) - and \emph{refcat-doi}, a subset of \emph{refcat} where entities - have a known DOI. At least 50\% of the 295,884,246 references only - in \emph{refcat-doi} come from links recorded within a specific dataset provider (GBIF, DOI prefix: 10.15468).} + \caption{Comparison between Open Citations COCI corpus (v11, + 2021-09-04) and \emph{refcat-doi}, a subset of \emph{refcat} where + entities have a known DOI. At least 150,727,673 (58.7\%) of the 256,985,697 references in + \emph{refcat-doi} only record links within a specific dataset provider; + here GBIF with DOI prefix: 10.15468.} \label{table:cocicmp} \end{center} \end{table} @@ -228,11 +229,11 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c \begin{center} \begin{tabular}{ll} \toprule - \bf{Edge type} & \bf{Count} \\ + \bf{Edge type} & \bf{Count} \\ \midrule - doi-doi & xxx 1,178,488,264 \\ - target-open-library & 20,307,064 \\ - source-wikipedia & 1,386,941 \\ + doi-doi & 1,303,424,212 \\ + target-open-library & 20,307,064 \\ + source-wikipedia & 1,386,941 \\ \end{tabular} \vspace*{2mm} \caption{Output structure, e.g. edges between documents that both have a doi (doi-doi).} @@ -241,7 +242,7 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c \end{table} We started to include non-traditional citations into the graph, such as links -to books as recorded by the Open Library Project and links from the English +to books as recorded by the Open Library project and links from the English Wikipedia to scholarly works. For links between Open Library we employ both identifier based and fuzzy matching; for Wikipedia references we used an existing dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing @@ -259,9 +260,9 @@ Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of the Internet Archive. From a sample\footnote{In a sample of 8000 links we find only 6138 responding with a HTTP 200, whereas the rest of the links yields a variety of http status codes, like 404, 403, 500 and others.} we observe, that -about 23\% of the links reference corpus links preserved at the Internet -Archive are not accessible on the world wide web currently - making targeted -web crawling and preservation of scholarly references an essential tool for +about 23\% of the links in the reference corpus links preserved at the Internet +Archive are not accessible on the world wide web currently\footnote{We used the \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command line link checking tool.} - making targeted +web crawling and preservation of scholarly references an activity for maintaining citation integrity. % unpigz -c fatcat-refs-urllist-2021-06-17_lookup-20210714045637.tsv.gz| LC_ALL=C grep -F ')/' | grep -c -E "\W200\W" |