diff options
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT/main.tex')
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 48 |
1 files changed, 31 insertions, 17 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index a5536d8..ab72699 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -47,7 +47,7 @@ crawls targeting primary and secondary scholarly outputs, as well as metadata from the Open Library\footnote{\url{https://openlibrary.org}} project and Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the - graph consists of 1,323,423,672 citations. We release this dataset under a CC0 + graph consists of over 1.3B citations. We release this dataset under a CC0 Public Domain Dedication, accessible through an archive item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}. The source code used for the derivation process, including exact and fuzzy @@ -59,28 +59,42 @@ \section{Introduction} - The Internet Archive releases a first version of a citation graph dataset derived from a raw corpus of about 2.5B references gathered from metadata and -data obtained by PDF extraction tools such as -GROBID\cite{lopez2009grobid}. Additionally, we consider integration with +data obtained by PDF extraction and annotation tools such as +GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with metadata from Open Library and Wikipedia. The goal of this report is to describe briefly the current contents and the derivation of the dataset. We expect this dataset to be iterated upon, with changes both in content and processing. +According to~\citep{jinha_2010} over 50M scholarly articles have been published +(from 1726) up to 2009, with the rate of publications on the +rise~\citep{landhuis_2016}. In 2014, a study based on academic search engines +estimated that at least 114M English-language scholarly documents are +accessible on the web~\citep{khabsa_giles_2014}. + Modern citation indexes can be traced back to the early computing age, when -projects like the Science Citation Index (1955)\citep{garfield2007evolution} +projects like the Science Citation Index (1955)~\citep{garfield2007evolution} were first devised, living on in existing commercial knowledge bases today. Open alternatives were started such as the Open Citations Corpus (OCC) in 2010 - the first version of which contained 6,325,178 individual -references\citep{shotton2013publishing}. Other notable early projects -include CiteSeerX\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last +references~\citep{shotton2013publishing}. Other notable early projects +include CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last decade has seen the emergence of more openly available, large scale -citation projects like Microsoft Academic\citep{sinha2015overview} or the -Initiative for Open Citations\footnote{\url{https://i4oc.org}}\citep{shotton2018funders}. +citation projects like Microsoft Academic~\citep{sinha2015overview} or the +Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}. In 2021, over one billion citations are publicly available, marking a ``tipping point'' -for this category of data\citep{hutchins2021tipping}. +for this category of data~\citep{hutchins2021tipping}. + +While a paper will mainly cite other papers, more citable entities exist such +as books and web links and within links a variety of targets, such as web +sites, reference entries, protocols or datasets. References can be extracted +manually or through more automated methods, such as metadata access and +structured data extraction from full text documents; the latter offering the +benefits of scalability. The completeness of bibliographic metadata ranges from +documents with one or more persistant identifiers to raw, potentially unclean +strings partially describing a publication. \section{Related Work} @@ -89,7 +103,7 @@ There are a few large scale citation dataset available today. COCI, the released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on 2021-07-29, it contains 1,094,394,688 citations across 65,835,422 bibliographic -resources\citep{peroni2020opencitations}. +resources~\citep{peroni2020opencitations}. The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, ``a Wikimedia initiative to develop open citations and linked bibliographic @@ -97,7 +111,7 @@ data to serve free knowledge'' continously adds citations to its database and as of 2021-06-28 tracks 253,719,394 citations across 39,994,937 publications\footnote{\url{http://wikicite.org/statistics.html}}. -Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of +Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at \url{https://archive.org/details/mag-2021-06-07}} the @@ -106,9 +120,9 @@ bibliographic entities. Numerous other projects have been or are concerned with various aspects of citation discovery and curation as part their feature set, among them Semantic -Scholar\citep{fricke2018semantic}, CiteSeerX\citep{li2006citeseerx} or Aminer\citep{tang2016aminer}. +Scholar~\citep{fricke2018semantic}, CiteSeerX~\citep{li2006citeseerx} or Aminer~\citep{tang2016aminer}. -As mentioned in \citep{hutchins2021tipping}, the number of openly available +As mentioned in~\citep{hutchins2021tipping}, the number of openly available citations is not expected to shrink in the future. @@ -120,7 +134,7 @@ or \emph{bref} for short). The dataset includes metadata from fatcat, the Open Library project and inbound links from the English Wikipedia. The fatcat project itself aggregates data from variety of open data sources, such as Crossref\footnote{\url{https://crossref.org}}, PubMed\footnote{\url{https://pubmed.ncbi.nlm.nih.gov/}}, -DataCite\footnote{\url{https://datacite.org}}, Directory of Open Access Jourals (DOAJ)\footnote{\url{https://doaj.org}}, dblp\citep{ley2002dblp} and others, +DataCite\footnote{\url{https://datacite.org}}, Directory of Open Access Jourals (DOAJ)\footnote{\url{https://doaj.org}}, dblp~\citep{ley2002dblp} and others, as well as metadata generated from analysis of data preserved at the Internet Archive and active crawls of publication sites on the web. @@ -214,9 +228,9 @@ Table~\ref{table:fields}. \end{center} \end{table} -Overall, a map-reduce style\citep{dean2010mapreduce} approach is +Overall, a map-reduce style~\citep{dean2010mapreduce} approach is followed\footnote{While the operations are similar, the processing is not - distributed but runs on a single machine. For space efficiency, zstd\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows + distributed but runs on a single machine. For space efficiency, zstd~\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows for some uniformity in the overall processing. We extract (key, document) tuples (as TSV) from the raw JSON data and sort by key. We then group documents with the |