diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-08-05 15:29:31 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-08-05 15:29:31 +0200 |
commit | 84c90811696d07257985295088e18a63e1d6cc21 (patch) | |
tree | ec7de8c665d909d92f9c847d03575d18d33457d3 | |
parent | 271e21820c8e9255e4bb31a1aac70a16b3e6f7a0 (diff) | |
download | refcat-84c90811696d07257985295088e18a63e1d6cc21.tar.gz refcat-84c90811696d07257985295088e18a63e1d6cc21.zip |
wip: paper, add table
-rw-r--r-- | docs/Simple/main.pdf | bin | 95909 -> 97848 bytes | |||
-rw-r--r-- | docs/Simple/main.tex | 88 |
2 files changed, 79 insertions, 9 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf Binary files differindex 9d8b292..8fe89a9 100644 --- a/docs/Simple/main.pdf +++ b/docs/Simple/main.pdf diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex index fd47f35..e88e6fd 100644 --- a/docs/Simple/main.tex +++ b/docs/Simple/main.tex @@ -197,33 +197,103 @@ verification, in order to control precision. As other dataset in this field we expect this dataset to be iterated upon. -\begin{description} - \item[$\bullet$] The fatcat catalog updates its metadata +\begin{itemize} + \item The fatcat catalog updates its metadata continously\footnote{A changelog can currenly be followed here: - \url{fatcat.wiki/changelog}} and web crawls are regularly conducted. Current processing pipelines cover raw reference snapshot creation and the rederivation the graph contained within. + \url{fatcat.wiki/changelog}} and web crawls are conducted regularly. + Current processing pipelines cover raw reference snapshot creation and + derivation the graph structure. - \item[$\bullet$] Metadata extraction from PDFs depends on machine learning + \item Metadata extraction from PDFs depends on machine learning models, which in turn depend training sets. With additional crawls and metadata available we hope to improve models used for metadata extraction, reducing data extraction artifacts in the process. - \item[$\bullet$] As of this version, a significant number of raw reference + \item As of this version, a significant number of raw reference docs remain unmatched, which means that neither exact or fuzzy matching can recover a link to a known entity. On the one hand, this can hint at missing metadata. However, parts of the data will contain a reference to a catalogued entity, but in a specific, dense and harder to recover form. -\end{description} + \end{itemize} \section{Acknowledgements} -Don't forget them or you'll have people with hurt feelings. Acknowledge anyone who contributed in any way: through discussions, feedback on drafts, implementation, etc. If in doubt about whether to include someone, include them. +This work is partially supported by a grant from the \emph{Andrew W. Mellon +Foundation}. We like to thanks various teams at the Internet Archive for +providing necessary infrastructure, and also data processing expertise. We are +also indebted to various open source software tools and their maintainers as +well as open scholarly data projects - without those this work would be much +harder or not possible at all. -\section{Citations} - \section{Appendix A} +\begin{table}[] + \footnotesize + \begin{center} +\begin{tabular}{@{}rlll@{}} +\toprule +\textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason} \\ \midrule +934932865 & crossref & exact & doi \\ +151366108 & fatcat-datacite & exact & doi \\ +65345275 & fatcat-pubmed & exact & pmid \\ +48778607 & fuzzy & strong & jaccardauthors \\ +42465250 & grobid & exact & doi \\ +29197902 & fatcat-pubmed & exact & doi \\ +19996327 & fatcat-crossref & exact & doi \\ +11996694 & fuzzy & strong & slugtitleauthormatch \\ +9157498 & fuzzy & strong & tokenizedauthors \\ +3547594 & grobid & exact & arxiv \\ +2310025 & fuzzy & exact & titleauthormatch \\ +1496515 & grobid & exact & pmid \\ +680722 & crossref & strong & jaccardauthors \\ +476331 & fuzzy & strong & versioneddoi \\ +449271 & grobid & exact & isbn \\ +230645 & fatcat-crossref & strong & jaccardauthors \\ +190578 & grobid & strong & jaccardauthors \\ +156657 & crossref & exact & isbn \\ +123681 & fatcat-pubmed & strong & jaccardauthors \\ +79328 & crossref & exact & arxiv \\ +57414 & crossref & strong & tokenizedauthors \\ +53480 & fuzzy & strong & pmiddoipair \\ +52453 & fuzzy & strong & dataciterelatedid \\ +47119 & grobid & strong & slugtitleauthormatch \\ +36774 & fuzzy & strong & arxivversion \\ +35311 & fuzzy & strong & customieeearxiv \\ +33863 & grobid & exact & pmcid \\ +23504 & crossref & strong & slugtitleauthormatch \\ +22753 & fatcat-crossref & strong & tokenizedauthors \\ +17720 & grobid & exact & titleauthormatch \\ +14656 & crossref & exact & titleauthormatch \\ +14438 & grobid & strong & tokenizedauthors \\ +7682 & fatcat-crossref & exact & arxiv \\ +5972 & fatcat-crossref & exact & isbn \\ +5525 & fatcat-pubmed & exact & arxiv \\ +4290 & fatcat-pubmed & strong & tokenizedauthors \\ +2745 & fatcat-pubmed & exact & isbn \\ +2342 & fatcat-pubmed & strong & slugtitleauthormatch \\ +2273 & fatcat-crossref & strong & slugtitleauthormatch \\ +1960 & fuzzy & exact & workid \\ +1150 & fatcat-crossref & exact & titleauthormatch \\ +1041 & fatcat-pubmed & exact & titleauthormatch \\ +895 & fuzzy & strong & figshareversion \\ +317 & fuzzy & strong & titleartifact \\ +82 & grobid & strong & titleartifact \\ +33 & crossref & strong & titleartifact \\ +5 & fuzzy & strong & custombsiundated \\ +1 & fuzzy & strong & custombsisubdoc \\ +1 & fatcat & exact & doi \\ \bottomrule +\end{tabular} + \vspace*{2mm} + \caption{Table of match counts, reference provenance, match status and +match reason. The match reason identifier encode a specific rule in the domain +dependent verification process and are included for completeness - we do not +include the details of each rule in this report.} + \label{table:fields} +\end{center} +\end{table} + \bibliographystyle{abbrv} \bibliography{refs} \end{document} |