diff options
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | bin | 140144 -> 140057 bytes | |||
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 83 |
2 files changed, 45 insertions, 38 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf Binary files differindex 97cfc56..f4a9e26 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index b020a47..c950f61 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -39,19 +39,25 @@ \begin{abstract} - As part of its scholarly data efforts, the Internet Archive releases a first version of a citation - graph dataset, named \emph{refcat}, derived from scholarly publications and - additional data sources. It is composed of data gathered by the fatcat - cataloging project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related web-scale - crawls targeting primary and secondary scholarly outputs, as well as metadata - from the Open Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}} project and - Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}. This first version of the - graph consists of over 1.3B citations. We release this dataset under a CC0 - Public Domain Dedication, accessible through an archive + As part of its scholarly data efforts, the Internet Archive releases a + first version of a citation graph dataset, named \emph{refcat}, derived + from scholarly publications and additional data sources. It is composed of + data gathered by the fatcat cataloging + project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related + web-scale crawls targeting primary and secondary scholarly outputs, as well + as metadata from the Open + Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}} + project and + Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}. + This first version of the graph consists of over 1.3B citations. We release + this dataset under a CC0 Public Domain Dedication, accessible through an + archive item\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}. The source code used for the derivation process, including exact and fuzzy citation matching, is released under an MIT license\footnote{\href{https://gitlab.com/internetarchive/refcat}{https://gitlab.com/internetarchive/refcat}}. + The goal of this report is to describe briefly the current contents and the + derivation of the dataset. \end{abstract} \keywords{Citation Graph, Web Archiving} @@ -59,13 +65,11 @@ \section{Introduction} The Internet Archive releases a first version of a citation graph dataset -derived from a raw corpus of about 2.5B references gathered from metadata and -data obtained by PDF extraction and annotation tools such as +derived from a raw corpus of about 2.5B raw references gathered from metadata +and data obtained by PDF extraction and annotation tools such as GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with -metadata from Open Library and Wikipedia. -The goal of this report is to describe briefly the current contents and the -derivation of the dataset. We expect -this dataset to be iterated upon, with changes both in content and processing. +metadata from Open Library and Wikipedia. We expect this dataset to be +iterated upon, with changes both in content and processing. According to~\citep{jinha_2010} over 50M scholarly articles have been published (from 1726) up to 2009, with the rate of publications on the @@ -86,14 +90,14 @@ Initiative for Open Citations\footnote{\href{https://i4oc.org}{https://i4oc.org} In 2021, over one billion citations are publicly available, marking a ``tipping point'' for this category of data~\citep{hutchins2021tipping}. -While a paper will often cite other papers, more citable entities exist such -as books or web links and within links a variety of targets, such as web -pages, reference entries, protocols or datasets. References can be extracted -manually or through more automated methods, by accessing relevant metadata or -structured data extraction from full text documents. Automated methods offer the -benefits of scalability. The completeness of bibliographic metadata ranges from -documents with one or more persistant identifiers to raw, potentially unclean -strings partially describing a scholarly artifact. +While a paper will often cite other papers, more citable entities exist such as +books or web links and within links a variety of targets, such as web pages, +reference entries, protocols or datasets. References can be extracted manually +or through more automated methods, by accessing relevant metadata or structured +data extraction from full text documents. Automated methods offer the benefits +of scalability. The completeness of bibliographic metadata in references ranges +from documents with one or more persistant identifiers to raw, potentially +unclean strings partially describing a scholarly artifact. \section{Related Work} @@ -163,7 +167,7 @@ with \emph{PaperReferences} being one relation among many others. \section{Dataset} -We release the first version of the \emph{refcat} dataset in an format used +We release the first version of the \emph{refcat} dataset in a format used internally for storage and to serve queries (and which we call \emph{biblioref} or \emph{bref} for short). The dataset includes metadata from fatcat, the Open Library project and inbound links from the English Wikipedia. The dataset is @@ -257,13 +261,16 @@ web links\footnote{The cleaning process is necessary because OCR artifacts and reference corpus, of which 4,827,688 have been preserved as of August 2021 with an HTTP 200 status code in the Wayback Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of -the Internet Archive. From a sample\footnote{In a sample of 8000 links we find - only 6138 responding with a HTTP 200, whereas the rest of the links yields a - variety of http status codes, like 404, 403, 500 and others.} we observe, that -about 23\% of the links in the reference corpus preserved at the Internet -Archive are not accessible on the world wide web currently\footnote{We used the \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command line link checking tool.} - making targeted -web crawling and preservation of scholarly references a key activity for -maintaining citation integrity. +the Internet Archive. + +In a random sample of 8000 links we find only 6138 responding +with an HTTP 200 OK, whereas the rest of the links yield a variety of HTTP status +codes, like 404, 403, 500 and others - resulting in 23\% of the links +in the reference corpus preserved at the Internet Archive being currently inaccessible on +the web\footnote{We used the + \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command + line link checking tool.} - making targeted web crawling and preservation of +scholarly references a key activity for maintaining citation integrity. % unpigz -c fatcat-refs-urllist-2021-06-17_lookup-20210714045637.tsv.gz| LC_ALL=C grep -F ')/' | grep -c -E "\W200\W" @@ -331,7 +338,7 @@ Heritrix\footnote{\href{https://github.com/internetarchive/heritrix3}{https://gi homepages, repositories, dataset providers, aggregators, web archives and other venues. A processing pipeline merges catalog data from the primary database and cached data from the key-value store and generates the set of about 2.5B -references documents, which currently serve as an input for the citation graph +references records, which currently serve as an input for the citation graph derivation pipeline. \subsection{Methodology} @@ -340,7 +347,7 @@ Overall, a map-reduce style~\citep{dean2010mapreduce} approach is followed\footnote{While the operations are similar, the processing is not distributed but runs on a single machine. For space efficiency, zstd~\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows for some -uniformity in the overall processing. We extract (key, document) tuples (as +uniformity in the processing. We extract \emph{(key, document)} tuples (as TSV) from the raw JSON data and sort by key. We then group documents with the same key and apply a function on each group in order to generate our target schema or perform @@ -357,8 +364,8 @@ are similar by various metrics calculated over title and author fields. The fuzz approach is applied on all reference documents without identifier (a title is currently required). -We currently implement performance sensitive parts in -Go\footnote{\href{https://golang.org/}{https://golang.org/}}, with various processing stages (e.g. +We currently implement performance sensitive parts in the +Go programming language\footnote{\href{https://golang.org/}{https://golang.org/}}, with various processing stages (e.g. conversion, map, reduce, ...) represented by separate command line tools. A thin task orchestration layer using the luigi framework\footnote{\href{https://github.com/spotify/luigi}{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani}, @@ -367,7 +374,7 @@ framework\footnote{\href{https://github.com/spotify/luigi}{https://github.com/sp and others.} allows for experimentation in the pipeline and for single command derivations, as data dependencies are encoded with the help of the orchestrator. Within the tasks, we also utilize classic platform tools such as -\emph{sort}~\citep{mcilroy1971research}. +GNU \emph{sort}~\citep{mcilroy1971research}. During a last processing step, we fuse reference matches and unmatched items into a single, indexable file. This step includes deduplication of different @@ -376,7 +383,7 @@ indexed into an search index and serves both matched and unmatched references for the web application, allowing for further collection of feedback on match quality and possible improvements. -With a few schema conversions, fuzzy matching can be applied to Wikipedia +With a few schema conversions, fuzzy matching has been be applied to Wikipedia articles and Open Library (edition) records as well. The aspect of precision and recall are represented by the two stages: we are generous in the match candidate generation phase in order to improve recall, but we are strict during @@ -419,7 +426,7 @@ As other dataset in this field we expect this dataset to be iterated upon. \section{Acknowledgements} -This work is partially supported by a grant from the \emph{Andrew W. Mellon +This work is partially supported by a grant (1910-07256) from the \emph{Andrew W. Mellon Foundation}. |