diff options
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT')
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf | bin | 140064 -> 140069 bytes | |||
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 62 |
2 files changed, 33 insertions, 29 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf Binary files differindex 1ef9640..076b8f3 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.pdf diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index 1b834f9..35d73b1 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -65,7 +65,7 @@ \section{Introduction} The Internet Archive releases a first version of a citation graph dataset -derived from a raw corpus of about 2.5B raw references gathered from metadata +derived from a corpus of about 2.5B raw references gathered from metadata and data obtained by PDF extraction and annotation tools such as GROBID~\cite{lopez2009grobid}. Additionally, we consider integration with metadata from Open Library and Wikipedia. We expect this dataset to be @@ -79,16 +79,18 @@ accessible on the web~\citep{khabsa_giles_2014}. Modern citation indexes can be traced back to the early computing age, when projects like the Science Citation Index (1955)~\citep{garfield2007evolution} -were first devised, living on in commercial knowledge bases today. -Open alternatives were started such as the Open Citations Corpus (OCC) in 2010 -- the first version of which contained 6,325,178 individual -references~\citep{shotton2013publishing}. Other notable projects -include CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and CitEc\footnote{\href{https://citec.repec.org}{https://citec.repec.org}}. The last -decade has seen the emergence of more openly available, large scale +were first devised, living on in commercial knowledge bases today. Open +alternatives were started such as the Open Citations Corpus (OCC) in 2010 - the +first version of which contained 6,325,178 individual +references~\citep{shotton2013publishing}. Other notable projects include +CiteSeer~\citep{giles1998citeseer}, CiteSeerX~\citep{wu2019citeseerx} and +CitEc\footnote{\href{https://citec.repec.org}{https://citec.repec.org}}. The +last decade has seen the emergence of more openly available, large scale citation projects like Microsoft Academic~\citep{sinha2015overview} and the -Initiative for Open Citations\footnote{\href{https://i4oc.org}{https://i4oc.org}}~\citep{shotton2018funders}. -In 2021, over one billion citations are publicly available, marking a ``tipping point'' -for this category of data~\citep{hutchins2021tipping}. +Initiative for Open +Citations\footnote{\href{https://i4oc.org}{https://i4oc.org}}~\citep{shotton2018funders}. +In 2021, over one billion citations are publicly available, marking a ``tipping +point'' for this category of data~\citep{hutchins2021tipping}. While a paper will often cite other papers, more citable entities exist such as books or web links and within links a variety of targets, such as web pages, @@ -101,17 +103,17 @@ unclean strings partially describing a scholarly artifact. \section{Related Work} -Two typical problems in citation graph development are related to data aquisition and citation matching. Data acquisition -itself can take different forms: bibliographic metadata can contain explicit -reference data as provided by publishers and aggregators; this data can be -relatively consistent when looked at per source, but may vary in style and -comprehensiveness when looked at as a whole. Another way of acquiring -bibliographic metadata is to analyze a source document, such as a PDF (or its -text), directly. Tools in this category are often based on conditional random -fields~\citep{lafferty2001conditional} and have been implemented in projects -such as ParsCit~\citep{councill2008parscit}, -Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite} or -GROBID~\citep{lopez2009grobid}. +Two typical problems in citation graph development are related to data +aquisition and citation matching. Data acquisition itself can take different +forms: bibliographic metadata can contain explicit reference data as provided +by publishers and aggregators; this data can be relatively consistent when +looked at per source, but may vary in style and comprehensiveness when looked +at as a whole. Another way of acquiring bibliographic metadata is to analyze a +source document, such as a PDF (or its text), directly. Tools in this category +are often based on conditional random fields~\citep{lafferty2001conditional} +and have been implemented in projects such as +ParsCit~\citep{councill2008parscit}, Cermine~\citep{tkaczyk2014cermine}, +EXCITE~\citep{hosseini2019excite} or GROBID~\citep{lopez2009grobid}. The problem of citation matching is relatively simple when common, persistent identifiers are present in the data. Complications mount, when there is @@ -125,14 +127,16 @@ citation matching process is done at scale~\citep{fedoryszak2013large}. The problem of heterogenity has been discussed in the context of datasets by~\citep{mathiak2015challenges}. -Projects centered around citations or containing citation data as a core -component are COCI, the ``OpenCitations Index of Crossref open DOI-to-DOI -citations'', which was first released -2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}} and has been -regularly updated since~\citep{peroni2020opencitations}. The WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}} project, -``a Wikimedia initiative to develop open citations and linked bibliographic -data to serve free knowledge'' continously adds citations to its -database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}. Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of +Projects and datasets centered around citations or containing citation data as +a core component are COCI, the ``OpenCitations Index of Crossref open +DOI-to-DOI citations'', which was first released +2018-07-29\footnote{\href{https://opencitations.net/download}{https://opencitations.net/download}} +and has been regularly updated since~\citep{peroni2020opencitations}. The +WikiCite\footnote{\href{https://meta.wikimedia.org/wiki/WikiCite}{https://meta.wikimedia.org/wiki/WikiCite}} +project, ``a Wikimedia initiative to develop open citations and linked +bibliographic data to serve free knowledge'' continously adds citations to its +database\footnote{\href{http://wikicite.org/statistics.html}{http://wikicite.org/statistics.html}}. +Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of entities\footnote{\href{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} with \emph{PaperReferences} being one relation among many others. |