From 517c19160d5f01a326da88174e55f700b83ceb87 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Wed, 8 Sep 2021 00:25:18 +0200 Subject: doc: tweaks --- docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 59 ++++++++++++--------------- 1 file changed, 26 insertions(+), 33 deletions(-) (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT/main.tex') diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index e99ddc3..e0fbc69 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -11,7 +11,6 @@ \usepackage{amsfonts} % blackboard math symbols \usepackage{nicefrac} % compact symbols for 1/2, etc. \usepackage{caption} - \usepackage{datetime} \providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1} \setlength{\parindent}{0pt} @@ -105,7 +104,7 @@ reference data as provided by publishers and aggregators; this data can be relatively consistent when looked at per source, but may vary in style and comprehensiveness when looked at as a whole. Another way of acquiring bibliographic metadata is to analyze a source document, such as a PDF (or its -text), directly. Tools in this category are often based on conditial random +text), directly. Tools in this category are often based on conditional random fields~\citep{lafferty2001conditional} and have been implemented in projects such as ParsCit~\citep{councill2008parscit}, Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite} or @@ -127,7 +126,7 @@ Projects centered around citations or containing citation data as a core component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI citations'', which was first released 2018-07-29\footnote{\url{https://opencitations.net/download}} and has been -regularly updated~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, +regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, ``a Wikimedia initiative to develop open citations and linked bibliographic data to serve free knowledge'' continously adds citations to its database\footnote{\url{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of @@ -167,17 +166,11 @@ with \emph{PaperReferences} being one relation among many others. We release the first version of the \emph{refcat} dataset in an format used internally for storage and to serve queries (and which we call \emph{biblioref} -or \emph{bref} for short). The dataset includes metadata from fatcat, the -Open Library project and inbound links from the English Wikipedia. The fatcat -project itself aggregates data from variety of open data sources, such as -Crossref\footnote{\url{https://crossref.org}}, PubMed\footnote{\url{https://pubmed.ncbi.nlm.nih.gov/}}, -DataCite\footnote{\url{https://datacite.org}}, Directory of Open Access Jourals (DOAJ)\footnote{\url{https://doaj.org}}, dblp~\citep{ley2002dblp} and others, -as well as metadata generated from analysis of data preserved at the Internet -Archive and active crawls of publication sites on the web. - -The dataset is +or \emph{bref} for short). The dataset includes metadata from fatcat, the Open +Library project and inbound links from the English Wikipedia. The dataset is integrated into the \href{https://fatcat.wiki}{fatcat website} and allows users -to explore inbound and outbound references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}. +to explore inbound and outbound +references\footnote{\url{https://guide.fatcat.wiki/reference_graph.html}}. The format records source and target (fatcat release and work) identifiers, a few attributes from the metadata (such as year or release stage) as well as @@ -186,13 +179,11 @@ information about the match status and provanance. The dataset currently contains 1,323,423,672 citations across 76,327,662 entities (55,123,635 unique source and 60,244,206 unique target work identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI -for both source and target). -The majority of matches - 1,250,523,321 - are established through identifier -based matching (DOI, PMIC, PMCID, ARXIV, ISBN). 72,900,351 citations are -established through fuzzy matching techniques. - -The majority of citations between \emph{refcat} and COCI overlap, as can be -seen in~Table~\ref{table:cocicmp}. +for both source and target). The majority of matches - 1,250,523,321 - are +established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN). +72,900,351 citations are established through fuzzy matching techniques. The +majority of citations between COCI and \emph{refcat} overlap, as can be seen +in~Table~\ref{table:cocicmp}. \begin{table}[] \begin{center} @@ -228,11 +219,11 @@ seen in~Table~\ref{table:cocicmp}. \subsection{Constraints} -The constraints for the systems design are informed by the volume and the +The constraints for the system design are informed by the volume and the variety of the data. The capability to run the whole graph derivation on a single machine was a minor goal as well. In total, the raw inputs amount to a few terabytes of textual content, mostly newline delimited JSON. More -importantly, while the number of data fields is low, certain schemas are very +importantly, while the number of data fields is low, certain documents are very partial with hundreds of different combinations of available field values found in the raw reference data. This is most likely caused by aggregators passing on reference data coming from hundreds of sources, each of which not necessarily @@ -242,7 +233,7 @@ machine learning based structured data extraction tools. Each combination of fields may require a slightly different processing path. For example, references with an Arxiv identifier can be processed differently from references with only a title. Over 50\% of the raw reference data comes -from a set of eight field set manifestations, as listed in +from a set of eight field set variants, as listed in Table~\ref{table:fields}. \begin{table}[] @@ -276,16 +267,18 @@ PDF extraction. The bibliographic metadata is taken from fatcat, which itself harvests and imports web accessible sources such as Crossref, Pubmed, Arxiv, Datacite, DOAJ, dblp and others into its catalog (as the source permits, data is processed continously or in batches). Reference data from PDF documents has -been extracted with GROBID\footnote{GROBID \href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the TEI-XML results -being cached locally in a key-value store accessible with an S3 API. Archived -PDF documents result from dedicated web-scale crawls of scholarly domains -conducted with -Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} and a -variety of seed lists targeting journal homepages, repositories, dataset -providers, aggregators, web archives and other venues. A processing pipeline -merges catalog data from the primary database and cached values in key-value -stores and generates the set of about 2.5B references documents, which -currently serve as an input for the citation graph derivation pipeline. +been extracted with GROBID\footnote{GROBID + \href{https://github.com/kermitt2/grobid/releases/tag/0.5.5}{v0.5.5}}, with the +TEI-XML results being cached locally in a key-value store accessible with an S3 +API. Archived PDF documents result from dedicated web-scale crawls of scholarly +domains conducted with +Heritrix\footnote{\url{https://github.com/internetarchive/heritrix3}} (and +other crawl technologies) and a variety of seed lists targeting journal +homepages, repositories, dataset providers, aggregators, web archives and other +venues. A processing pipeline merges catalog data from the primary database and +cached data from the key-value store and generates the set of about 2.5B +references documents, which currently serve as an input for the citation graph +derivation pipeline. \subsection{Methodology} -- cgit v1.2.3