From 84c90811696d07257985295088e18a63e1d6cc21 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Thu, 5 Aug 2021 15:29:31 +0200 Subject: wip: paper, add table --- docs/Simple/main.pdf | Bin 95909 -> 97848 bytes docs/Simple/main.tex | 88 +++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 79 insertions(+), 9 deletions(-) (limited to 'docs') diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf index 9d8b292..8fe89a9 100644 Binary files a/docs/Simple/main.pdf and b/docs/Simple/main.pdf differ diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex index fd47f35..e88e6fd 100644 --- a/docs/Simple/main.tex +++ b/docs/Simple/main.tex @@ -197,33 +197,103 @@ verification, in order to control precision. As other dataset in this field we expect this dataset to be iterated upon. -\begin{description} - \item[$\bullet$] The fatcat catalog updates its metadata +\begin{itemize} + \item The fatcat catalog updates its metadata continously\footnote{A changelog can currenly be followed here: - \url{fatcat.wiki/changelog}} and web crawls are regularly conducted. Current processing pipelines cover raw reference snapshot creation and the rederivation the graph contained within. + \url{fatcat.wiki/changelog}} and web crawls are conducted regularly. + Current processing pipelines cover raw reference snapshot creation and + derivation the graph structure. - \item[$\bullet$] Metadata extraction from PDFs depends on machine learning + \item Metadata extraction from PDFs depends on machine learning models, which in turn depend training sets. With additional crawls and metadata available we hope to improve models used for metadata extraction, reducing data extraction artifacts in the process. - \item[$\bullet$] As of this version, a significant number of raw reference + \item As of this version, a significant number of raw reference docs remain unmatched, which means that neither exact or fuzzy matching can recover a link to a known entity. On the one hand, this can hint at missing metadata. However, parts of the data will contain a reference to a catalogued entity, but in a specific, dense and harder to recover form. -\end{description} + \end{itemize} \section{Acknowledgements} -Don't forget them or you'll have people with hurt feelings. Acknowledge anyone who contributed in any way: through discussions, feedback on drafts, implementation, etc. If in doubt about whether to include someone, include them. +This work is partially supported by a grant from the \emph{Andrew W. Mellon +Foundation}. We like to thanks various teams at the Internet Archive for +providing necessary infrastructure, and also data processing expertise. We are +also indebted to various open source software tools and their maintainers as +well as open scholarly data projects - without those this work would be much +harder or not possible at all. -\section{Citations} - \section{Appendix A} +\begin{table}[] + \footnotesize + \begin{center} +\begin{tabular}{@{}rlll@{}} +\toprule +\textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason} \\ \midrule +934932865 & crossref & exact & doi \\ +151366108 & fatcat-datacite & exact & doi \\ +65345275 & fatcat-pubmed & exact & pmid \\ +48778607 & fuzzy & strong & jaccardauthors \\ +42465250 & grobid & exact & doi \\ +29197902 & fatcat-pubmed & exact & doi \\ +19996327 & fatcat-crossref & exact & doi \\ +11996694 & fuzzy & strong & slugtitleauthormatch \\ +9157498 & fuzzy & strong & tokenizedauthors \\ +3547594 & grobid & exact & arxiv \\ +2310025 & fuzzy & exact & titleauthormatch \\ +1496515 & grobid & exact & pmid \\ +680722 & crossref & strong & jaccardauthors \\ +476331 & fuzzy & strong & versioneddoi \\ +449271 & grobid & exact & isbn \\ +230645 & fatcat-crossref & strong & jaccardauthors \\ +190578 & grobid & strong & jaccardauthors \\ +156657 & crossref & exact & isbn \\ +123681 & fatcat-pubmed & strong & jaccardauthors \\ +79328 & crossref & exact & arxiv \\ +57414 & crossref & strong & tokenizedauthors \\ +53480 & fuzzy & strong & pmiddoipair \\ +52453 & fuzzy & strong & dataciterelatedid \\ +47119 & grobid & strong & slugtitleauthormatch \\ +36774 & fuzzy & strong & arxivversion \\ +35311 & fuzzy & strong & customieeearxiv \\ +33863 & grobid & exact & pmcid \\ +23504 & crossref & strong & slugtitleauthormatch \\ +22753 & fatcat-crossref & strong & tokenizedauthors \\ +17720 & grobid & exact & titleauthormatch \\ +14656 & crossref & exact & titleauthormatch \\ +14438 & grobid & strong & tokenizedauthors \\ +7682 & fatcat-crossref & exact & arxiv \\ +5972 & fatcat-crossref & exact & isbn \\ +5525 & fatcat-pubmed & exact & arxiv \\ +4290 & fatcat-pubmed & strong & tokenizedauthors \\ +2745 & fatcat-pubmed & exact & isbn \\ +2342 & fatcat-pubmed & strong & slugtitleauthormatch \\ +2273 & fatcat-crossref & strong & slugtitleauthormatch \\ +1960 & fuzzy & exact & workid \\ +1150 & fatcat-crossref & exact & titleauthormatch \\ +1041 & fatcat-pubmed & exact & titleauthormatch \\ +895 & fuzzy & strong & figshareversion \\ +317 & fuzzy & strong & titleartifact \\ +82 & grobid & strong & titleartifact \\ +33 & crossref & strong & titleartifact \\ +5 & fuzzy & strong & custombsiundated \\ +1 & fuzzy & strong & custombsisubdoc \\ +1 & fatcat & exact & doi \\ \bottomrule +\end{tabular} + \vspace*{2mm} + \caption{Table of match counts, reference provenance, match status and +match reason. The match reason identifier encode a specific rule in the domain +dependent verification process and are included for completeness - we do not +include the details of each rule in this report.} + \label{table:fields} +\end{center} +\end{table} + \bibliographystyle{abbrv} \bibliography{refs} \end{document} -- cgit v1.2.3