aboutsummaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-09-06 14:18:37 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-09-06 14:18:37 +0200
commit3dede8e214168ec628180f31542ac3132b8e2338 (patch)
tree2680013887f8ffd11de31e23a617f80be8a607ef /docs
parentf33e586d11f5f575f71ad209608ac9ba74fad2e5 (diff)
downloadrefcat-3dede8e214168ec628180f31542ac3132b8e2338.tar.gz
refcat-3dede8e214168ec628180f31542ac3132b8e2338.zip
docs: start addressing feedback from MR#4
Diffstat (limited to 'docs')
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.pdfbin95948 -> 96426 bytes
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/main.tex48
-rw-r--r--docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib31
3 files changed, 62 insertions, 17 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
index 75a5449..9fafcc0 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf
Binary files differ
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
index a5536d8..ab72699 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex
@@ -47,7 +47,7 @@
crawls targeting primary and secondary scholarly outputs, as well as metadata
from the Open Library\footnote{\url{https://openlibrary.org}} project and
Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
- graph consists of 1,323,423,672 citations. We release this dataset under a CC0
+ graph consists of over 1.3B citations. We release this dataset under a CC0
Public Domain Dedication, accessible through an archive
item\footnote{\url{https://archive.org/details/refcat_2021-07-28}}.
The source code used for the derivation process, including exact and fuzzy
@@ -59,28 +59,42 @@
\section{Introduction}
-
The Internet Archive releases a first version of a citation graph dataset
derived from a raw corpus of about 2.5B references gathered from metadata and
-data obtained by PDF extraction tools such as
-GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
+data obtained by PDF extraction and annotation tools such as
+GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with
metadata from Open Library and Wikipedia.
The goal of this report is to describe briefly the current contents and the
derivation of the dataset. We expect
this dataset to be iterated upon, with changes both in content and processing.
+According to~\citep{jinha_2010} over 50M scholarly articles have been published
+(from 1726) up to 2009, with the rate of publications on the
+rise~\citep{landhuis_2016}. In 2014, a study based on academic search engines
+estimated that at least 114M English-language scholarly documents are
+accessible on the web~\citep{khabsa_giles_2014}.
+
Modern citation indexes can be traced back to the early computing age, when
-projects like the Science Citation Index (1955)\citep{garfield2007evolution}
+projects like the Science Citation Index (1955)~\citep{garfield2007evolution}
were first devised, living on in existing commercial knowledge bases today.
Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
-references\citep{shotton2013publishing}. Other notable early projects
-include CiteSeerX\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
+references~\citep{shotton2013publishing}. Other notable early projects
+include CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\url{https://citec.repec.org}}. The last
decade has seen the emergence of more openly available, large scale
-citation projects like Microsoft Academic\citep{sinha2015overview} or the
-Initiative for Open Citations\footnote{\url{https://i4oc.org}}\citep{shotton2018funders}.
+citation projects like Microsoft Academic~\citep{sinha2015overview} or the
+Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton2018funders}.
In 2021, over one billion citations are publicly available, marking a ``tipping point''
-for this category of data\citep{hutchins2021tipping}.
+for this category of data~\citep{hutchins2021tipping}.
+
+While a paper will mainly cite other papers, more citable entities exist such
+as books and web links and within links a variety of targets, such as web
+sites, reference entries, protocols or datasets. References can be extracted
+manually or through more automated methods, such as metadata access and
+structured data extraction from full text documents; the latter offering the
+benefits of scalability. The completeness of bibliographic metadata ranges from
+documents with one or more persistant identifiers to raw, potentially unclean
+strings partially describing a publication.
\section{Related Work}
@@ -89,7 +103,7 @@ There are a few large scale citation dataset available today. COCI, the
released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
2021-07-29, it contains
1,094,394,688 citations across 65,835,422 bibliographic
-resources\citep{peroni2020opencitations}.
+resources~\citep{peroni2020opencitations}.
The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
``a Wikimedia initiative to develop open citations and linked bibliographic
@@ -97,7 +111,7 @@ data to serve free knowledge'' continously adds citations to its database and
as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
publications\footnote{\url{http://wikicite.org/statistics.html}}.
-Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of
+Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of
entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
\url{https://archive.org/details/mag-2021-06-07}} the
@@ -106,9 +120,9 @@ bibliographic entities.
Numerous other projects have been or are concerned with various aspects of
citation discovery and curation as part their feature set, among them Semantic
-Scholar\citep{fricke2018semantic}, CiteSeerX\citep{li2006citeseerx} or Aminer\citep{tang2016aminer}.
+Scholar~\citep{fricke2018semantic}, CiteSeerX~\citep{li2006citeseerx} or Aminer~\citep{tang2016aminer}.
-As mentioned in \citep{hutchins2021tipping}, the number of openly available
+As mentioned in~\citep{hutchins2021tipping}, the number of openly available
citations is not expected to shrink in the future.
@@ -120,7 +134,7 @@ or \emph{bref} for short). The dataset includes metadata from fatcat, the
Open Library project and inbound links from the English Wikipedia. The fatcat
project itself aggregates data from variety of open data sources, such as
Crossref\footnote{\url{https://crossref.org}}, PubMed\footnote{\url{https://pubmed.ncbi.nlm.nih.gov/}},
-DataCite\footnote{\url{https://datacite.org}}, Directory of Open Access Jourals (DOAJ)\footnote{\url{https://doaj.org}}, dblp\citep{ley2002dblp} and others,
+DataCite\footnote{\url{https://datacite.org}}, Directory of Open Access Jourals (DOAJ)\footnote{\url{https://doaj.org}}, dblp~\citep{ley2002dblp} and others,
as well as metadata generated from analysis of data preserved at the Internet
Archive and active crawls of publication sites on the web.
@@ -214,9 +228,9 @@ Table~\ref{table:fields}.
\end{center}
\end{table}
-Overall, a map-reduce style\citep{dean2010mapreduce} approach is
+Overall, a map-reduce style~\citep{dean2010mapreduce} approach is
followed\footnote{While the operations are similar, the processing is not
- distributed but runs on a single machine. For space efficiency, zstd\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
+ distributed but runs on a single machine. For space efficiency, zstd~\citep{collet2018zstandard} is used to compress raw data and derivations.}, which allows
for some
uniformity in the overall processing. We extract (key, document) tuples (as
TSV) from the raw JSON data and sort by key. We then group documents with the
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib b/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib
index 4dd7f6a..f927ea4 100644
--- a/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib
+++ b/docs/TR-20210808100000-IA-WDS-REFCAT/refs.bib
@@ -240,3 +240,34 @@
pages={99--108},
year={2018}
}
+
+@article{jinha_2010,
+ title={Article 50 million: an estimate of the number of scholarly articles in existence},
+ volume={23},
+ DOI={10.1087/20100308},
+ publisher={Wiley},
+ author={Jinha},
+ year={2010},
+ month={Jul}
+}
+
+@article{landhuis_2016,
+ title={Scientific literature: Information overload},
+ volume={535},
+ DOI={10.1038/nj7612-457a},
+ number={7612},
+ publisher={Springer Nature},
+ author={Landhuis},
+ year={2016},
+ month={Jul}
+}
+
+@article{khabsa_giles_2014,
+ title={The Number of Scholarly Documents on the Public Web},
+ DOI={10.1371/journal.pone.0093949},
+ publisher={Public Library of Science (PLoS)},
+ author={Khabsa and Giles},
+ editor={Zhang},
+ year={2014},
+ month={May}
+}