From a593442e8b38f27da039ae20ae5e7b49e5dafdd1 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Sat, 2 Oct 2021 00:41:41 +0200 Subject: docs: address feedback on report --- docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | Bin 140144 -> 140827 bytes docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 48 +++++++++++++++----------- 2 files changed, 28 insertions(+), 20 deletions(-) (limited to 'docs') diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf index be9bda0..0ec16c7 100644 Binary files a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf and b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf differ diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index 7ac8e46..e1f985c 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -65,11 +65,12 @@ \section{Introduction} The Internet Archive released a first version of a citation graph dataset -derived from a corpus of about 2.5B raw references gathered from metadata -and data obtained by PDF extraction and annotation tools such as -GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with -metadata from Open Library and Wikipedia. We expect this dataset to be -iterated upon, with changes both in content and processing. +derived from a corpus of about 2.5B raw references\footnote{Number of raw + references: 2,507,793,772} gathered from 63,296,308 metadata records (which are +collected from various sources or based on data obtained by PDF extraction and +annotation tools such as GROBID~\citep{lopez2009grobid}). Additionally, we +consider integration with metadata from Open Library and Wikipedia. We expect +this dataset to be iterated upon, with changes both in content and processing. According to~\citep{jinha_2010} over 50M scholarly articles have been published (from 1726) up to 2009, with the rate of publications on the @@ -181,7 +182,7 @@ inbound and outbound references\footnote{\href{https://guide.fatcat.wiki/reference\_graph.html}{https://guide.fatcat.wiki/reference\_graph.html}}. The format records source and target identifiers, a -few metadata attributes (such as year or release stage) as well as +few metadata attributes (such as year or release stage, i.e. preprint, version of record, etc) as well as information about the match status and provenance. The dataset currently contains 1,323,423,672 citations across 76,327,662 @@ -189,7 +190,10 @@ entities (55,123,635 unique source and 60,244,206 unique target work identifiers; for 1,303,424,212 - or 98.49\% of all citations - we do have a DOI for both source and target). The majority of matches - 1,250,523,321 - is established through identifier based matching (DOI, PMIC, PMCID, ARXIV, ISBN). -72,900,351 citations are established through fuzzy matching techniques. +72,900,351 citations are established through fuzzy matching techniques, where +references did not contain identifiers\footnote{This not necessary mean that + the records in question do not have an identifier; however if an identifier + existed, it was not part of the raw reference}. Citations from the Open Citations' COCI corpus\footnote{Reference dataset COCI v11, released 2021-09-04, \href{http://opencitations.net/index/coci}{http://opencitations.net/index/coci}} @@ -285,14 +289,17 @@ scholarly references a key activity for maintaining citation integrity. The constraints for the system design are informed by the volume and the variety of the data. The capability to run the whole graph derivation on a -single machine was a minor goal as well. In total, the raw inputs amount to a -few terabytes of textual content, mostly newline delimited JSON. More -importantly, while the number of data fields is low, certain documents are very -partial with hundreds of different combinations of available field values found -in the raw reference data. This is most likely caused by aggregators passing on -reference data coming from hundreds of sources, each of which not necessarily -agreeing on a common granularity for citation data and from artifacts of -machine learning based structured data extraction tools. +single machine\footnote{We used a shared virtual server with 24 cores and 48G + of main memory. The most memory-intensive part of the processing currently are + the buffers set aside for \emph{GNU sort}.} was a minor goal as well. In +total, the raw inputs amount to a few terabytes of textual content, mostly +newline delimited JSON. More importantly, while the number of data fields is +low, certain documents are very partial with hundreds of different combinations +of available field values found in the raw reference data. This is most likely +caused by aggregators passing on reference data coming from hundreds of +sources, each of which not necessarily agreeing on a common granularity for +citation data and from artifacts of machine learning based structured data +extraction tools. Each combination of fields may require a slightly different processing path. For example, references with an Arxiv identifier can be processed differently @@ -338,8 +345,8 @@ API\footnote{Currently, \href{https://github.com/chrislusf/seaweedfs}{https://github.com/chrislusf/seaweedfs} is used}. Archived PDF documents result from dedicated web-scale crawls of scholarly domains conducted with -Heritrix\footnote{\href{https://github.com/internetarchive/heritrix3}{https://github.com/internetarchive/heritrix3}} -(and other crawl technologies) and a variety of seed lists targeting journal +multiple open-source crawler technologies created by the Internet Archive +and a variety of seed lists targeting journal homepages, repositories, dataset providers, aggregators, web archives and other venues. A processing pipeline merges catalog data from the primary database and cached data from the key-value store and generates the set of about 2.5B @@ -402,7 +409,7 @@ their expected or desired match status\footnote{The list can be found under: \section{Limitations and Future Work} -As other dataset in this field we expect this dataset to be iterated upon. +As with other datasets in this field we expect this dataset to be iterated upon. \begin{itemize} \item The fatcat catalog updates its metadata @@ -431,8 +438,9 @@ As other dataset in this field we expect this dataset to be iterated upon. \section{Acknowledgements} -This work is partially supported by a grant (1910-07256) from the \emph{Andrew W. Mellon - Foundation}. +This work is partially supported by grants from the \emph{Andrew W. Mellon + Foundation}, especially ''Ensuring the Persistent Access of Open Access Journal +Literature: Phase II`` (1910-07256, Jefferson Bailey, Principal Investigator). \section{Appendix A} -- cgit v1.2.3