From 31cd17cf2a1e5611935cc86dc89a752f581e1a16 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Fri, 1 Oct 2021 18:54:03 +0200 Subject: docs: first round on report review corrections --- docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | Bin 140069 -> 140144 bytes docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 49 +++++++++++++------------- 2 files changed, 25 insertions(+), 24 deletions(-) diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf index 830f25f..be9bda0 100644 Binary files a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf and b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf differ diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index 0543612..7ac8e46 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -17,7 +17,7 @@ \begin{document} -\title{Refcat: The Fatcat Citation Graph} +\title{Refcat: The Internet Archive Scholar Citation Graph} \author{Martin Czygan \\ \\ @@ -39,20 +39,20 @@ \begin{abstract} - As part of its scholarly data efforts, the Internet Archive releases a + As part of its scholarly data efforts, the Internet Archive (IA) releases a first version of a citation graph dataset, named \emph{refcat}, derived from scholarly publications and additional data sources. It is composed of data gathered by the fatcat cataloging - project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related + project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}} (the catalog that underpins IA Scholar), related web-scale crawls targeting primary and secondary scholarly outputs, as well as metadata from the Open Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}} project and Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}. This first version of the graph consists of over 1.3B citations. We release - this dataset under a CC0 Public Domain Dedication, accessible through an - archive - item\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}. + this dataset under a CC0 Public Domain Dedication, accessible through + Internet + Archive\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}. The source code used for the derivation process, including exact and fuzzy citation matching, is released under an MIT license\footnote{\href{https://gitlab.com/internetarchive/refcat}{https://gitlab.com/internetarchive/refcat}}. @@ -64,7 +64,7 @@ \section{Introduction} -The Internet Archive releases a first version of a citation graph dataset +The Internet Archive released a first version of a citation graph dataset derived from a corpus of about 2.5B raw references gathered from metadata and data obtained by PDF extraction and annotation tools such as GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with @@ -131,10 +131,10 @@ Projects and datasets centered around citations or containing citation data as a core component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI citations'', which was first released 2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}} -and has been regularly updated since~\citep{peroni2020opencitations}. The +and has been regularly updated~\citep{peroni2020opencitations}. The WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}} project, ``a Wikimedia initiative to develop open citations and linked -bibliographic data to serve free knowledge'' continously adds citations to its +bibliographic data to serve free knowledge'' continuously adds citations to its database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of entities\footnote{\href{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} @@ -171,15 +171,16 @@ with \emph{PaperReferences} being one relation among many others. \section{Dataset} -We release the first version of the \emph{refcat} dataset in a format used +We released the first version of the \emph{refcat} dataset in a format used internally for storage and to serve queries (and which we call \emph{biblioref} -or \emph{bref} for short). The dataset includes metadata from fatcat, the Open -Library project and inbound links from the English Wikipedia. The dataset is -integrated into the \href{https://fatcat.wiki}{fatcat.wiki website} and allows users -to explore inbound and outbound +or \emph{bref} for short). The dataset includes metadata from fatcat (the +catalog underpinning IA Scholar), the Open Library project and inbound links +from the English Wikipedia. The dataset is integrated into the +\href{https://fatcat.wiki}{fatcat.wiki website} and allows users to explore +inbound and outbound references\footnote{\href{https://guide.fatcat.wiki/reference\_graph.html}{https://guide.fatcat.wiki/reference\_graph.html}}. -The format records source and target (fatcat release and work) identifiers, a +The format records source and target identifiers, a few metadata attributes (such as year or release stage) as well as information about the match status and provenance. @@ -196,16 +197,16 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c \begin{table}[] \begin{center} - \begin{tabular}{ll} + \begin{tabular}{lll} \toprule - \bf{Set} & \bf{Count} \\ + \bf{Set} & & \bf{Count} \\ \midrule - COCIv11 (C) & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l - \emph{refcat-doi} (R) & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv - C $\cap$ R & 1,046,438,515 \\ - C $\setminus$ R & 140,520,382 \\ % 86,854,309 \\ - R $\setminus$ C & 256,985,697 \\ % xxx 295,884,246 + COCIv11 (C) & & 1,186,958,897 \\ % zstdcat -T0 6741422v11.csv.zst | pv -l | wc -l + \emph{refcat-doi} (R) & & 1,303,424,212 \\ % zstdcat -T0 /magna/refcat/2021-07-28/BrefDOITable/date-2021-07-28.tsv.zst | pv -l | LC_ALL=C sort -T /sandcrawler-db/tmp-refcat/ -S70% -k3,4 -u | zstd -c -T0 > uniq_34.tsv.zst # LC_ALL=C wc -l uniq_34_doi_lower_sorted.csv + C $\cap$ R & overlap & 1,046,438,515 \\ + C $\setminus$ R & COCIv11 only & 140,520,382 \\ % 86,854,309 \\ + R $\setminus$ C & refcat-doi only & 256,985,697 \\ % xxx 295,884,246 \end{tabular} \vspace*{2mm} \caption{Comparison between Open Citations COCI corpus (v11, @@ -251,7 +252,7 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c \end{table} We started to include non-traditional citations into the graph, such as links -to books as recorded by the Open Library project and links from the English +to books included in Open Library and links from the English Wikipedia to scholarly works. For links between Open Library we employ both identifier based and fuzzy matching; for Wikipedia references we used a published dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing to upstream projects related to wikipedia citation extraction, such as @@ -329,7 +330,7 @@ Reference data comes from two main sources: explicit bibliographic metadata and PDF extraction. The bibliographic metadata is taken from fatcat, which itself harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv, Datacite, DOAJ, dblp and others into its catalog (as the source permits, data -is processed continously or in batches). Reference data from PDF documents has +is processed continuously or in batches). Reference data from PDF documents has been extracted with GROBID\footnote{GROBID \href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the TEI-XML results being cached locally in a key-value store accessible with an S3 -- cgit v1.2.3