aboutsummaryrefslogtreecommitdiffstats
path: root/docs/Simple
diff options
context:
space:
mode:
Diffstat (limited to 'docs/Simple')
-rw-r--r--docs/Simple/main.pdfbin88783 -> 90674 bytes
-rw-r--r--docs/Simple/main.tex56
-rw-r--r--docs/Simple/refs.bib39
3 files changed, 68 insertions, 27 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index a4a5841..90bc6bd 100644
--- a/docs/Simple/main.pdf
+++ b/docs/Simple/main.pdf
Binary files differ
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index 57b8cde..ea52b54 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -1,4 +1,4 @@
-\documentclass[10pt,twocolumn]{article}
+\documentclass[hidelinks,10pt,twocolumn]{article}
\usepackage{simpleConference}
\usepackage[utf8]{inputenc}
\usepackage{times}
@@ -18,7 +18,7 @@
\begin{document}
-\title{Archive Scholar Reference Dataset}
+\title{Fatcat Reference Dataset}
\author{Martin Czygan \\
\\
@@ -41,17 +41,16 @@ bnewbold@archive.org \\
\begin{abstract}
As part of its scholarly data efforts, the Internet Archive releases a citation
-graph dataset (ASREF) derived from scholarly publications and additional data
-sources. It is composed of data gathered by the fatcat cataloging
-project\footnote{\url{https://fatcat.wiki}}, related web-scale crawls targeting
-primary and secondary scholarly outputs, as well as metadata from the Open
-Library\footnote{\url{https://openlibrary.org}} project, information about
-archived web pages found in the Wayback
-Machine\footnote{\url{https://web.archive.org}} and
+graph dataset, named \emph{refcat}, derived from scholarly publications and
+additional data sources. It is composed of data gathered by the fatcat
+cataloging project\footnote{\url{https://fatcat.wiki}}, related web-scale
+crawls targeting primary and secondary scholarly outputs, as well as metadata
+from the Open Library\footnote{\url{https://openlibrary.org}} project and
Wikipedia\footnote{\url{https://wikipedia.org}}. This first version of the
-graph consists of 1,323,423,672 citations. We release this dataset under a CC0 Public Domain Dedication, accessible through an
-archive collection\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All code
-used in the derivation process is releases under an MIT
+graph consists of 1,323,423,672 citations. We release this dataset under a CC0
+Public Domain Dedication, accessible through an archive
+collection\footnote{\url{https://archive.org/details/fatcat-asref-todo}}. All
+code used in the derivation process is releases under an MIT
license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
\end{abstract}
@@ -62,11 +61,11 @@ license\footnote{\url{https://gitlab.com/internetarchive/cgraph}}.
The Internet Archive releases a first version of a citation graph dataset
derived from a raw corpus of about 2.5B references gathered from metadata and
-from data obtained by PDF extraction tools such as
+data obtained by PDF extraction tools such as
GROBID\cite{lopez2009grobid}. Additionally, we consider integration with
-metadata from Open Library, the Wayback Machine and Wikipedia.
+metadata from Open Library and Wikipedia.
The goal of this report is to describe briefly the current contents and the
-derivation of the Archive Scholar Reference Dataset (ASREF). We expect
+derivation of the dataset (refcat). We expect
this dataset to be iterated upon, with changes both in content and processing.
Modern citation indexes can be traced back to the early computing age, when
@@ -74,37 +73,39 @@ projects like the Science Citation Index (1955)\citep{garfield2007evolution}
were first devised, living on in existing commercial knowledge bases today.
Open alternatives were started such as the Open Citations Corpus (OCC) in 2010
- the first version of which contained 6,325,178 individual
-references\citep{shotton2013publishing}. Other notable sources from that time
+references\citep{shotton2013publishing}. Other notable early projects
include CiteSeerX\citep{wu2019citeseerx} and CitEc\citep{CitEc}. The last
-decade has seen an increase of more openly available reference dataset and
-citation projects, like Microsoft Academic\citep{sinha2015overview} and
+decade has seen the emergance of more openly available, large scale
+citation projects, like Microsoft Academic\citep{sinha2015overview} or the
Initiative for Open Citations\citep{i4oc}\citep{shotton2018funders}. In 2021,
according to \citep{hutchins2021tipping} over 1B citations are publicly
-available, marking a tipping point for open citations.
+available, marking a tipping point for this category of data.
\section{Related Work}
There are a few large scale citation dataset available today. COCI, the
``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first
-released 2018-07-29. As of its most recent release on 2021-07-29, it contains
-1,094,394,688 citations across 65,835,422 bibliographic resources.
+released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on
+2021-07-29, it contains
+1,094,394,688 citations across 65,835,422 bibliographic
+resources\citep{peroni2020opencitations}.
The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project,
``a Wikimedia initiative to develop open citations and linked bibliographic
-data to serve free knowledge'' continously adds citations to its data base and
+data to serve free knowledge'' continously adds citations to its database and
as of 2021-06-28 tracks 253,719,394 citations across 39,994,937
publications\footnote{\url{http://wikicite.org/statistics.html}}.
-Microsoft Academic Graph\footnote{A recent copy has been preserved at
-\url{https://archive.org/details/mag-2021-06-07}} is comprised of a number of
+Microsoft Academic Graph\citep{sinha2015overview} is comprised of a number of
entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}}
-with \emph{PaperReferences} being one relation among many others. As of 2021-06-07 the
+with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at
+\url{https://archive.org/details/mag-2021-06-07}} the
\emph{PaperReferences} relation contains 1,832,226,781 edges across 123,923,466
bibliographic entities.
Numerous other projects have been or are concerned with various aspects of
-citation discovery and curation, among them Semantic Scholar, CiteSeerX or
-Aminer.
+citation discovery and curation as part their feature set, among them Semantic
+Scholar\citep{fricke2018semantic}, CiteSeerX\citep{li2006citeseerx} or Aminer\citep{tang2016aminer}.
As mentioned in \citep{hutchins2021tipping}, the number of openly available
citations is not expected to shrink in the future.
@@ -339,5 +340,6 @@ include the details of each rule in this report.}
\end{table}
\bibliographystyle{abbrv}
+% \bibliographystyle{plainnat}
\bibliography{refs}
\end{document}
diff --git a/docs/Simple/refs.bib b/docs/Simple/refs.bib
index bcb8a16..599a386 100644
--- a/docs/Simple/refs.bib
+++ b/docs/Simple/refs.bib
@@ -78,6 +78,15 @@
year={2019}
}
+@inproceedings{li2006citeseerx,
+ title={CiteSeerx: an architecture and web service design for an academic document search engine},
+ author={Li, Huajing and Councill, Isaac and Lee, Wang-Chien and Giles, C Lee},
+ booktitle={Proceedings of the 15th international conference on World Wide Web},
+ pages={883--884},
+ year={2006}
+}
+
+
@inproceedings{sinha2015overview,
title={An overview of microsoft academic service (mas) and applications},
author={Sinha, Arnab and Shen, Zhihong and Song, Yang and Ma, Hao and Eide, Darrin and Hsu, Bo-June and Wang, Kuansan},
@@ -121,3 +130,33 @@ note = {Accessed: 2021-07-30}
publisher={HeinOnline}
}
+@article{peroni2020opencitations,
+ title={OpenCitations, an infrastructure organization for open scholarship},
+ author={Peroni, Silvio and Shotton, David},
+ journal={Quantitative Science Studies},
+ volume={1},
+ number={1},
+ pages={428--444},
+ year={2020},
+ publisher={MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…}
+}
+
+@article{fricke2018semantic,
+ title={Semantic scholar},
+ author={Fricke, Suzanne},
+ journal={Journal of the Medical Library Association: JMLA},
+ volume={106},
+ number={1},
+ pages={145},
+ year={2018},
+ publisher={Medical Library Association}
+}
+
+@inproceedings{tang2016aminer,
+ title={AMiner: Toward understanding big scholar data},
+ author={Tang, Jie},
+ booktitle={Proceedings of the ninth ACM international conference on web search and data mining},
+ pages={467--467},
+ year={2016}
+}
+