diff options
Diffstat (limited to 'docs/TR-20210808100000-IA-WDS-REFCAT/main.tex')
-rw-r--r-- | docs/TR-20210808100000-IA-WDS-REFCAT/main.tex | 86 |
1 files changed, 57 insertions, 29 deletions
diff --git a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex index ab72699..2a60a77 100644 --- a/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex +++ b/docs/TR-20210808100000-IA-WDS-REFCAT/main.tex @@ -87,9 +87,9 @@ Initiative for Open Citations\footnote{\url{https://i4oc.org}}~\citep{shotton201 In 2021, over one billion citations are publicly available, marking a ``tipping point'' for this category of data~\citep{hutchins2021tipping}. -While a paper will mainly cite other papers, more citable entities exist such -as books and web links and within links a variety of targets, such as web -sites, reference entries, protocols or datasets. References can be extracted +While a paper will often cite other papers, more citable entities exist such +as books or web links and within links a variety of targets, such as web +pages, reference entries, protocols or datasets. References can be extracted manually or through more automated methods, such as metadata access and structured data extraction from full text documents; the latter offering the benefits of scalability. The completeness of bibliographic metadata ranges from @@ -98,32 +98,60 @@ strings partially describing a publication. \section{Related Work} -There are a few large scale citation dataset available today. COCI, the -``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first -released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on -2021-07-29, it contains -1,094,394,688 citations across 65,835,422 bibliographic -resources~\citep{peroni2020opencitations}. - -The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, -``a Wikimedia initiative to develop open citations and linked bibliographic -data to serve free knowledge'' continously adds citations to its database and -as of 2021-06-28 tracks 253,719,394 citations across 39,994,937 -publications\footnote{\url{http://wikicite.org/statistics.html}}. - -Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of -entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} -with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at - \url{https://archive.org/details/mag-2021-06-07}} the -\emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466 -bibliographic entities. - -Numerous other projects have been or are concerned with various aspects of -citation discovery and curation as part their feature set, among them Semantic -Scholar~\citep{fricke2018semantic}, CiteSeerX~\citep{li2006citeseerx} or Aminer~\citep{tang2016aminer}. - -As mentioned in~\citep{hutchins2021tipping}, the number of openly available -citations is not expected to shrink in the future. +Typical problems arising in the process of compiling a citation graph dataset +are data aquisition and citation matching. Data acquisition itself can take +different forms: bibliographic metadata can contain explicit reference data as +provided by publishers and aggregators; this data can be relatively consistent +when looked at per source, but may vary in style and comprehensiveness when +looked at as a whole. Another way of acquiring bibliographic metadata is to +analyze a source document, such as a PDF (or its text), directly. Tools in this +category are often based on conditial random +fields~\citep{lafferty2001conditional} and have been implemented in projects +such as ParsCit~\citep{councill2008parscit}, +Cermine~\citep{tkaczyk2014cermine}, EXCITE~\citep{hosseini2019excite} +or GROBID~\citep{lopez2009grobid}. + +The problem of citation matching is relatively simple when common, persistent +identifiers are present in the data. Complications mount, when there is +\emph{Identity Uncertainty}, that is ``objects are not labeled with unique +identifiers or when those identifiers may not be perceived +perfectly''~\citep{pasula2003identity}. CiteSeer has been an early project +concerned with citation matching~\citep{giles1998citeseer}. A taxonomy of +potential issues common in the matching process has been compiled +by~\citep{olensky2016evaluation}. Additional care is required, when the +citation matching process is done at scale~\citep{fedoryszak2013large}. The +problem of heterogenity has been discussed in the context of datasets +by~\citep{mathiak2015challenges}. + + + + +% There are a few large scale citation dataset available today. COCI, the +% ``OpenCitations Index of Crossref open DOI-to-DOI citations'' was first +% released 2018-07-29. As of its most recent release\footnote{\url{https://opencitations.net/download}}, on +% 2021-07-29, it contains +% 1,094,394,688 citations across 65,835,422 bibliographic +% resources~\citep{peroni2020opencitations}. +% +% The WikiCite\footnote{\url{https://meta.wikimedia.org/wiki/WikiCite}} project, +% ``a Wikimedia initiative to develop open citations and linked bibliographic +% data to serve free knowledge'' continously adds citations to its database and +% as of 2021-06-28 tracks 253,719,394 citations across 39,994,937 +% publications\footnote{\url{http://wikicite.org/statistics.html}}. +% +% Microsoft Academic Graph~\citep{sinha2015overview} is comprised of a number of +% entities\footnote{\url{https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema}} +% with \emph{PaperReferences} being one relation among many others. As of 2021-06-07\footnote{A recent copy has been preserved at +% \url{https://archive.org/details/mag-2021-06-07}} the +% \emph{PaperReferences} relation contains 1,832,226,781 rows (edges) across 123,923,466 +% bibliographic entities. +% +% Numerous other projects have been or are concerned with various aspects of +% citation discovery and curation as part their feature set, among them Semantic +% Scholar~\citep{fricke2018semantic}, CiteSeerX~\citep{li2006citeseerx} or Aminer~\citep{tang2016aminer}. +% +% As mentioned in~\citep{hutchins2021tipping}, the number of openly available +% citations is not expected to shrink in the future. \section{Dataset} |