aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-08-05 18:40:29 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-08-05 18:40:29 +0200
commit0f9f1386d9ad66040d5a546bf037f41c173a84ab (patch)
treed2ab2dec67fe2d1dd20ec5ee1c2ec3619f108841
parent4ef80e0424b25b700597a604686c5f794e8b36d2 (diff)
downloadrefcat-0f9f1386d9ad66040d5a546bf037f41c173a84ab.tar.gz
refcat-0f9f1386d9ad66040d5a546bf037f41c173a84ab.zip
wip: paper
-rw-r--r--docs/Simple/main.pdfbin99828 -> 88783 bytes
-rw-r--r--docs/Simple/main.tex31
2 files changed, 21 insertions, 10 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf
index 469fa95..3a66061 100644
--- a/docs/Simple/main.pdf
+++ b/docs/Simple/main.pdf
Binary files differ
diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex
index 36f2074..b0fbaed 100644
--- a/docs/Simple/main.tex
+++ b/docs/Simple/main.tex
@@ -10,6 +10,7 @@
\usepackage{booktabs} % professional-quality tables
\usepackage{amsfonts} % blackboard math symbols
\usepackage{nicefrac} % compact symbols for 1/2, etc.
+\usepackage{caption}
\usepackage{datetime}
\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1}
@@ -158,14 +159,15 @@ TODO: how matches are established and a short note on overlap with COCI DOI.
\section{System Design}
The constraints for the systems design are informed by the volume and the
-variety of the data. In total, the raw inputs amount to a few TB of textual
-content, mostly newline delimited JSON. More importantly, while the number of
-data fields is low, certain schemas are very partial with hundreds of different
-combinations of available field values found in the raw reference data. This is
-most likely caused by aggregators passing on reference data coming from
-hundreds of sources, each of which not necessarily agreeing on a common
-granularity for citation data and from artifacts of machine learning based
-structured data extraction tools.
+variety of the data. The capability to run the graph whole derivation on a
+single machine (commodity hardware) was a minor goal as well. In total, the raw inputs amount to a few
+TB of textual content, mostly newline delimited JSON. More importantly, while
+the number of data fields is low, certain schemas are very partial with
+hundreds of different combinations of available field values found in the raw
+reference data. This is most likely caused by aggregators passing on reference
+data coming from hundreds of sources, each of which not necessarily agreeing on
+a common granularity for citation data and from artifacts of machine learning
+based structured data extraction tools.
Each combination of fields may require a slightly different processing path.
For example, references with an Arxiv identifier can be processed differently
@@ -262,8 +264,17 @@ harder or not possible at all.
\section{Appendix A}
+
+
+A note on data quality: While we implement various data quality measures,
+real-world data, especially coming from many different sources will contain
+errors and bugs. Among other measures, we keep track of match reasons,
+especially for fuzzy matching to be able to zoom in on systematic errors a bit
+more easily (see~Table~\ref{table:matches}).
+
\begin{table}[]
\footnotesize
+ \captionsetup{font=normalsize}
\begin{center}
\begin{tabular}{@{}rlll@{}}
\toprule
@@ -319,11 +330,11 @@ harder or not possible at all.
1 & fatcat & exact & doi \\ \bottomrule
\end{tabular}
\vspace*{2mm}
- \caption{Table of match counts, reference provenance, match status and
+ \caption{Table of match counts, reference provenance, match status and
match reason. The match reason identifier encode a specific rule in the domain
dependent verification process and are included for completeness - we do not
include the details of each rule in this report.}
- \label{table:fields}
+ \label{table:matches}
\end{center}
\end{table}