From 5e351469cead9337293b449b946db9d3c2c49925 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Fri, 10 Sep 2021 19:30:30 +0200 Subject: docs: draft version --- docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | Bin 105989 -> 93825 bytes docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 67 +++++++++++++------------- 2 files changed, 33 insertions(+), 34 deletions(-) (limited to 'docs') diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf index a827f61..f4273e4 100644 Binary files a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf and b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf differ diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index dc500dc..b278149 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -42,16 +42,16 @@ As part of its scholarly data efforts, the Internet Archive releases a first version of a citation graph dataset, named \emph{refcat}, derived from scholarly publications and additional data sources. It is composed of data gathered by the fatcat - cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale + cataloging project\footnote{\href{https://fatcat.wiki}{https://fatcat.wiki}}, related web-scale crawls targeting primary and secondary scholarly outputs, as well as metadata - from the Open Library\footnote{\url{https://openlibrary.org}} project and - Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the + from the Open Library\footnote{\href{https://openlibrary.org}{https://openlibrary.org}} project and + Wikipedia\footnote{\href{https://wikipedia.org}{https://wikipedia.org}}. This first version of the graph consists of over 1.3B citations. We release this dataset under a CC0 Public Domain Dedication, accessible through an archive - item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}. + item\footnote{\href{https://archive.org/details/refcat\_2021-07-28}{https://archive.org/details/refcat\_2021-07-28}}. The source code used for the derivation process, including exact and fuzzy citation matching, is released under an MIT - license\footnote{\url{https://gitlab.com/internetarchive/refcat}}. + license\footnote{\href{https://gitlab.com/internetarchive/refcat}{https://gitlab.com/internetarchive/refcat}}. \end{abstract} \keywords{Citation Graph, Web Archiving} @@ -79,26 +79,25 @@ were first devised, living on in existing commercial knowledge bases today. Open alternatives were started such as the Open Citations Corpus (OCC) in 2010 - the first version of which contained 6,325,178 individual references~\citep{shotton2013publishing}. Other notable projects -include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last +include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\href{https://citec.repec.org}{https://citec.repec.org}}. The last decade has seen the emergence of more openly available, large scale citation projects like Microsoft Academic~\citep{sinha2015overview} and the -Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}. +Initiative for Open Citations\footnote{\href{https://i4oc.org}{https://i4oc.org}}~\citep{shotton2018funders}. In 2021, over one billion citations are publicly available, marking a ``tipping point'' for this category of data~\citep{hutchins2021tipping}. While a paper will often cite other papers, more citable entities exist such as books or web links and within links a variety of targets, such as web pages, reference entries, protocols or datasets. References can be extracted -manually or through more automated methods, such as metadata access and -structured data extraction from full text documents; the latter offering the +manually or through more automated methods, by accessing relevant metadata or +structured data extraction from full text documents. Automated methods offer the benefits of scalability. The completeness of bibliographic metadata ranges from documents with one or more persistant identifiers to raw, potentially unclean strings partially describing a scholarly artifact. \section{Related Work} -Two typical problems which arise in the process of compiling a citation graph -dataset are related to data aquisition and citation matching. Data acquisition +Two typical problems in citation graph development are related to data aquisition and citation matching. Data acquisition itself can take different forms: bibliographic metadata can contain explicit reference data as provided by publishers and aggregators; this data can be relatively consistent when looked at per source, but may vary in style and @@ -125,12 +124,12 @@ by~\citep{mathiak2015challenges}. Projects centered around citations or containing citation data as a core component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI citations'', which was first released -2018-07-29\footnote{\url{https://opencitations.net/download}} and has been -regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, +2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}} and has been +regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}} project, ``a Wikimedia initiative to develop open citations and linked bibliographic data to serve free knowledge'' continously adds citations to its -database\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of -entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} +database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of +entities\footnote{\href{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} with \emph{PaperReferences} being one relation among many others. @@ -170,7 +169,7 @@ or \emph{bref} for short). The dataset includes metadata from fatcat, the Open Library project and inbound links from the English Wikipedia. The dataset is integrated into the \href{https://fatcat.wiki}{fatcat.wiki website} and allows users to explore inbound and outbound -references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}. +references\footnote{\href{https://guide.fatcat.wiki/reference\_graph.html}{https://guide.fatcat.wiki/reference\_graph.html}}. The format records source and target (fatcat release and work) identifiers, a few metadata attributes (such as year or release stage) as well as @@ -182,7 +181,7 @@ identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI for both source and target). The majority of matches - 1,250,523,321 - is established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are established through fuzzy matching techniques. -Citations from the Open Citations COCI corpus\footnote{Reference dataset COCI +Citations from the Open Citations' COCI corpus\footnote{Reference dataset COCI v11, released 2021-09-04, \href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}} and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:cocicmp}. @@ -236,7 +235,9 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c source-wikipedia & 1,386,941 \\ \end{tabular} \vspace*{2mm} - \caption{Output structure, e.g. edges between documents that both have a doi (doi-doi).} + \caption{Counts of classic DOI to DOI references as well as outbound + references matched against Open Library as well as inbound references + from the English Wikipedia.} \label{table:structure} \end{center} \end{table} @@ -244,25 +245,24 @@ and \emph{refcat} overlap to the most part, as can be seen in~Table~\ref{table:c We started to include non-traditional citations into the graph, such as links to books as recorded by the Open Library project and links from the English Wikipedia to scholarly works. For links between Open Library we employ both -identifier based and fuzzy matching; for Wikipedia references we used an -existing dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing +identifier based and fuzzy matching; for Wikipedia references we used a published dataset~\citep{harshdeep_singh_2020_3940692} and we are contributing to upstream projects related to wikipedia citation extraction, such as \emph{wikiciteparser}\footnote{\href{https://github.com/dissemin/wikiciteparser}{https://github.com/dissemin/wikiciteparser}} -to generate updates to the dataset. Table~\ref{table:structure} lists the +to generate updates from recent Wikipedia dumps\footnote{Wikipedia dumps are available on a monthly basis from \href{https://dumps.wikimedia.org/}{https://dumps.wikimedia.org/}.}. Table~\ref{table:structure} lists the counts for these links. Additionally, we are examining web links appearing in references: after an initial cleaning procedure we currently find 25,405,592 web links\footnote{The cleaning process is necessary because OCR artifacts and other metadata issues exist in the data. Unfortunately, even after cleaning not all links will be in the form as originally intended by the authors.} in the -reference corpus, of which 4,827,688 have been preserved with an HTTP 200 +reference corpus, of which 4,827,688 have been preserved as of August 2021 with an HTTP 200 status code in the Wayback Machine\footnote{\href{https://archive.org/web/}{https://archive.org/web/}} of the Internet Archive. From a sample\footnote{In a sample of 8000 links we find only 6138 responding with a HTTP 200, whereas the rest of the links yields a variety of http status codes, like 404, 403, 500 and others.} we observe, that -about 23\% of the links in the reference corpus links preserved at the Internet +about 23\% of the links in the reference corpus preserved at the Internet Archive are not accessible on the world wide web currently\footnote{We used the \href{https://github.com/miku/clinker}{https://github.com/miku/clinker} command line link checking tool.} - making targeted -web crawling and preservation of scholarly references an activity for +web crawling and preservation of scholarly references a key activity for maintaining citation integrity. % unpigz -c fatcat-refs-urllist-2021-06-17_lookup-20210714045637.tsv.gz| LC_ALL=C grep -F ')/' | grep -c -E "\W200\W" @@ -324,7 +324,7 @@ been extracted with GROBID\footnote{GROBID TEI-XML results being cached locally in a key-value store accessible with an S3 API. Archived PDF documents result from dedicated web-scale crawls of scholarly domains conducted with -Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} (and +Heritrix\footnote{\href{https://github.com/internetarchive/heritrix3}{https://github.com/internetarchive/heritrix3}} (and other crawl technologies) and a variety of seed lists targeting journal homepages, repositories, dataset providers, aggregators, web archives and other venues. A processing pipeline merges catalog data from the primary database and @@ -348,7 +348,7 @@ The key derivation can be exact (via an identifier like DOI, PMID, etc) or based on a value normalization, like ``slugifying'' a title string. For identifier based matches we can generate the target schema directly. For fuzzy matching candidates, we pass possible match pairs through a verification procedure, -which is implemented for \emph{release entity}\footnote{\url{https://guide.fatcat.wiki/entity_release.html}.} pairs. This procedure is a +which is implemented for \emph{release entity}\footnote{\href{https://guide.fatcat.wiki/entity\_release.html}{https://guide.fatcat.wiki/entity\_release.html}.} pairs. This procedure is a domain dependent rule based verification, able to identify different versions of a publication, preprint-published pairs and documents, which are are similar by various metrics calculated over title and author fields. The fuzzy matching @@ -356,10 +356,10 @@ approach is applied on all reference documents without identifier (a title is currently required). We currently implement performance sensitive parts in -Go\footnote{\url{https://golang.org/}}, with various processing stages (e.g. +Go\footnote{\href{https://golang.org/}{https://golang.org/}}, with various processing stages (e.g. conversion, map, reduce, ...) represented by separate command line tools. A thin task orchestration layer using the luigi -framework\footnote{\url{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani}, +framework\footnote{\href{https://github.com/spotify/luigi}{https://github.com/spotify/luigi}~\citep{bernhardsson2018rouhani}, which has been used in various scientific pipeline application, like~\citep{schulz2016use},~\citep{erdmann2017design},~\citep{lampa2019scipipe},~\citep{czygan2014design} and others.} allows for experimentation in the pipeline and for single command @@ -374,7 +374,7 @@ candidate generation phase in order to improve recall, but we are strict during verification, in order to control precision. Quality assurance for verification is implemented through a growing list of test cases of real examples from the catalog and their expected or desired match status\footnote{The list can be found under: - \url{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}. + \href{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}{https://gitlab.com/internetarchive/refcat/-/blob/master/skate/testdata/verify.csv}. It is helpful to keep this test suite independent of any specific programming language.}. @@ -386,8 +386,8 @@ As other dataset in this field we expect this dataset to be iterated upon. \begin{itemize} \item The fatcat catalog updates its metadata continously\footnote{A changelog can currenly be followed here: - \url{https://fatcat.wiki/changelog}.} and web crawls are conducted - regularly. Current processing pipelines cover raw reference snapshot + \href{https://fatcat.wiki/changelog}{https://fatcat.wiki/changelog}.} and web crawls are conducted + regularly. Current processing pipelines cover raw reference snapshot creation and derivation of the graph structure, which allows to rerun processing based on updated data as it becomes available. @@ -399,11 +399,10 @@ As other dataset in this field we expect this dataset to be iterated upon. \item As of this version, a number of raw reference docs remain unmatched, which means that neither exact nor fuzzy matching - has detected a link to a known entity. On the one - hand, this can hint at missing metadata. However, parts of the data + has detected a link to a known entity. Metadata might be missing. However, parts of the data will contain a reference to a catalogued entity, but in a specific, dense and harder to recover form. - This also include improvements to the fuzzy matching approach. + \item The reference dataset contains millions of URLs and their integration into the graph has been implemented as a prototype. A full implementation requires a few data cleanup and normalization steps. -- cgit v1.2.3