aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-08-05 15:29:31 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-08-05 15:29:31 +0200
commit84c90811696d07257985295088e18a63e1d6cc21 (patch)
treeec7de8c665d909d92f9c847d03575d18d33457d3
parent271e21820c8e9255e4bb31a1aac70a16b3e6f7a0 (diff)
downloadrefcat-84c90811696d07257985295088e18a63e1d6cc21.tar.gz
refcat-84c90811696d07257985295088e18a63e1d6cc21.zip
wip: paper, add table
-rw-r--r--docs/Simple/main.pdfbin95909 -> 97848 bytes
-rw-r--r--docs/Simple/main.tex88
2 files changed, 79 insertions, 9 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index 9d8b292..8fe89a9 100644
--- a/docs/Simple/main.pdf
+++ b/docs/Simple/main.pdf
Binary files differ
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index fd47f35..e88e6fd 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -197,33 +197,103 @@ verification, in order to control precision.
As other dataset in this field we expect this dataset to be iterated upon.
-\begin{description}
- \item[$\bullet$] The fatcat catalog updates its metadata
+\begin{itemize}
+ \item The fatcat catalog updates its metadata
continously\footnote{A changelog can currenly be followed here:
- \url{fatcat.wiki/changelog}} and web crawls are regularly conducted. Current processing pipelines cover raw reference snapshot creation and the rederivation the graph contained within.
+ \url{fatcat.wiki/changelog}} and web crawls are conducted regularly.
+ Current processing pipelines cover raw reference snapshot creation and
+ derivation the graph structure.
- \item[$\bullet$] Metadata extraction from PDFs depends on machine learning
+ \item Metadata extraction from PDFs depends on machine learning
models, which in turn depend training sets. With additional crawls and
metadata available we hope to improve models used for metadata
extraction, reducing data extraction artifacts in the process.
- \item[$\bullet$] As of this version, a significant number of raw reference
+ \item As of this version, a significant number of raw reference
docs remain unmatched, which means that neither exact or fuzzy matching
can recover a link to a known entity. On the one
hand, this can hint at missing metadata. However, parts of the data
will contain a reference to a catalogued entity, but in a specific,
dense and harder to recover form.
-\end{description}
+ \end{itemize}
\section{Acknowledgements}
-Don't forget them or you'll have people with hurt feelings. Acknowledge anyone who contributed in any way: through discussions, feedback on drafts, implementation, etc. If in doubt about whether to include someone, include them.
+This work is partially supported by a grant from the \emph{Andrew W. Mellon
+Foundation}. We like to thanks various teams at the Internet Archive for
+providing necessary infrastructure, and also data processing expertise. We are
+also indebted to various open source software tools and their maintainers as
+well as open scholarly data projects - without those this work would be much
+harder or not possible at all.
-\section{Citations}
-
\section{Appendix A}
+\begin{table}[]
+ \footnotesize
+ \begin{center}
+\begin{tabular}{@{}rlll@{}}
+\toprule
+\textbf{Count} & \textbf{Provenance} & \textbf{Status} & \textbf{Reason} \\ \midrule
+934932865 & crossref & exact & doi \\
+151366108 & fatcat-datacite & exact & doi \\
+65345275 & fatcat-pubmed & exact & pmid \\
+48778607 & fuzzy & strong & jaccardauthors \\
+42465250 & grobid & exact & doi \\
+29197902 & fatcat-pubmed & exact & doi \\
+19996327 & fatcat-crossref & exact & doi \\
+11996694 & fuzzy & strong & slugtitleauthormatch \\
+9157498 & fuzzy & strong & tokenizedauthors \\
+3547594 & grobid & exact & arxiv \\
+2310025 & fuzzy & exact & titleauthormatch \\
+1496515 & grobid & exact & pmid \\
+680722 & crossref & strong & jaccardauthors \\
+476331 & fuzzy & strong & versioneddoi \\
+449271 & grobid & exact & isbn \\
+230645 & fatcat-crossref & strong & jaccardauthors \\
+190578 & grobid & strong & jaccardauthors \\
+156657 & crossref & exact & isbn \\
+123681 & fatcat-pubmed & strong & jaccardauthors \\
+79328 & crossref & exact & arxiv \\
+57414 & crossref & strong & tokenizedauthors \\
+53480 & fuzzy & strong & pmiddoipair \\
+52453 & fuzzy & strong & dataciterelatedid \\
+47119 & grobid & strong & slugtitleauthormatch \\
+36774 & fuzzy & strong & arxivversion \\
+35311 & fuzzy & strong & customieeearxiv \\
+33863 & grobid & exact & pmcid \\
+23504 & crossref & strong & slugtitleauthormatch \\
+22753 & fatcat-crossref & strong & tokenizedauthors \\
+17720 & grobid & exact & titleauthormatch \\
+14656 & crossref & exact & titleauthormatch \\
+14438 & grobid & strong & tokenizedauthors \\
+7682 & fatcat-crossref & exact & arxiv \\
+5972 & fatcat-crossref & exact & isbn \\
+5525 & fatcat-pubmed & exact & arxiv \\
+4290 & fatcat-pubmed & strong & tokenizedauthors \\
+2745 & fatcat-pubmed & exact & isbn \\
+2342 & fatcat-pubmed & strong & slugtitleauthormatch \\
+2273 & fatcat-crossref & strong & slugtitleauthormatch \\
+1960 & fuzzy & exact & workid \\
+1150 & fatcat-crossref & exact & titleauthormatch \\
+1041 & fatcat-pubmed & exact & titleauthormatch \\
+895 & fuzzy & strong & figshareversion \\
+317 & fuzzy & strong & titleartifact \\
+82 & grobid & strong & titleartifact \\
+33 & crossref & strong & titleartifact \\
+5 & fuzzy & strong & custombsiundated \\
+1 & fuzzy & strong & custombsisubdoc \\
+1 & fatcat & exact & doi \\ \bottomrule
+\end{tabular}
+ \vspace*{2mm}
+ \caption{Table of match counts, reference provenance, match status and
+match reason. The match reason identifier encode a specific rule in the domain
+dependent verification process and are included for completeness - we do not
+include the details of each rule in this report.}
+ \label{table:fields}
+\end{center}
+\end{table}
+
\bibliographystyle{abbrv}
\bibliography{refs}
\end{document}