From 271e21820c8e9255e4bb31a1aac70a16b3e6f7a0 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Thu, 5 Aug 2021 15:12:49 +0200 Subject: wip: paper --- docs/Simple/main.pdf | Bin 95379 -> 95909 bytes docs/Simple/main.tex | 30 +++++++++++++++++++++++++----- 2 files changed, 25 insertions(+), 5 deletions(-) diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf index 399f5a2..9d8b292 100644 Binary files a/docs/Simple/main.pdf and b/docs/Simple/main.pdf differ diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex index ca26e19..fd47f35 100644 --- a/docs/Simple/main.tex +++ b/docs/Simple/main.tex @@ -98,7 +98,12 @@ with PaperReferences being one relation among many others. As of 2021-06-07 the PaperReferences relation contains 1,832,226,781 edges across YYY bibliographic entities. -TODO: COCI MAG Wikicite Citeseer, Parsecit, Aminer, Semantic Scholar +Numerous other projects have been or are concerned with various aspects of +citation discovery and curation, among them Semantic Scholar, CiteSeerX or +Aminer. + +As mentioned in \citep{hutchins2021tipping}, the number of openly available +citations is not expected to shrink in the future. \section{Citation Dataset} @@ -113,6 +118,8 @@ of use, we include DOI as well, if available. The dataset currently contains X unique bibliographic entities and Y citations. +TODO: how matches are established and a short note on overlap with COCI DOI. + \section{System Design} @@ -185,14 +192,27 @@ verification, in order to control precision. \section{Fuzzy Matching Approach} \section{Quality Assurance} -In general a short summarizing paragraph will do, and under no circumstances should the paragraph simply repeat material from the Abstract or Introduction. In some cases it's possible to now make the original claims more concrete, e.g., by referring to quantitative performance results. \section{Future Work} -This material is important -- part of the value of a paper is showing how the work sets new research directions. I like bullet lists here. A couple of things to keep in mind: +As other dataset in this field we expect this dataset to be iterated upon. + \begin{description} - \item[$\bullet$] If you're actively engaged in follow-up work, say so. E.g.: ``We are currently extending the algorithm to... blah blah, and preliminary results are encouraging." This statement serves to mark your territory. -\item[$\bullet$] Conversely, be aware that some researchers look to Future Work sections for research topics. My opinion is that there's nothing wrong with that -- consider it a compliment. + \item[$\bullet$] The fatcat catalog updates its metadata + continously\footnote{A changelog can currenly be followed here: + \url{fatcat.wiki/changelog}} and web crawls are regularly conducted. Current processing pipelines cover raw reference snapshot creation and the rederivation the graph contained within. + + \item[$\bullet$] Metadata extraction from PDFs depends on machine learning + models, which in turn depend training sets. With additional crawls and + metadata available we hope to improve models used for metadata + extraction, reducing data extraction artifacts in the process. + + \item[$\bullet$] As of this version, a significant number of raw reference + docs remain unmatched, which means that neither exact or fuzzy matching + can recover a link to a known entity. On the one + hand, this can hint at missing metadata. However, parts of the data + will contain a reference to a catalogued entity, but in a specific, + dense and harder to recover form. \end{description} \section{Acknowledgements} -- cgit v1.2.3