diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-08-05 18:40:29 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-08-05 18:40:29 +0200 |
commit | 0f9f1386d9ad66040d5a546bf037f41c173a84ab (patch) | |
tree | d2ab2dec67fe2d1dd20ec5ee1c2ec3619f108841 | |
parent | 4ef80e0424b25b700597a604686c5f794e8b36d2 (diff) | |
download | refcat-0f9f1386d9ad66040d5a546bf037f41c173a84ab.tar.gz refcat-0f9f1386d9ad66040d5a546bf037f41c173a84ab.zip |
wip: paper
-rw-r--r-- | docs/Simple/main.pdf | bin | 99828 -> 88783 bytes | |||
-rw-r--r-- | docs/Simple/main.tex | 31 |
2 files changed, 21 insertions, 10 deletions
diff --git a/docs/Simple/main.pdf b/docs/Simple/main.pdf Binary files differindex 469fa95..3a66061 100644 --- a/docs/Simple/main.pdf +++ b/docs/Simple/main.pdf diff --git a/docs/Simple/main.tex b/docs/Simple/main.tex index 36f2074..b0fbaed 100644 --- a/docs/Simple/main.tex +++ b/docs/Simple/main.tex @@ -10,6 +10,7 @@ \usepackage{booktabs} % professional-quality tables \usepackage{amsfonts} % blackboard math symbols \usepackage{nicefrac} % compact symbols for 1/2, etc. +\usepackage{caption} \usepackage{datetime} \providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1} @@ -158,14 +159,15 @@ TODO: how matches are established and a short note on overlap with COCI DOI. \section{System Design} The constraints for the systems design are informed by the volume and the -variety of the data. In total, the raw inputs amount to a few TB of textual -content, mostly newline delimited JSON. More importantly, while the number of -data fields is low, certain schemas are very partial with hundreds of different -combinations of available field values found in the raw reference data. This is -most likely caused by aggregators passing on reference data coming from -hundreds of sources, each of which not necessarily agreeing on a common -granularity for citation data and from artifacts of machine learning based -structured data extraction tools. +variety of the data. The capability to run the graph whole derivation on a +single machine (commodity hardware) was a minor goal as well. In total, the raw inputs amount to a few +TB of textual content, mostly newline delimited JSON. More importantly, while +the number of data fields is low, certain schemas are very partial with +hundreds of different combinations of available field values found in the raw +reference data. This is most likely caused by aggregators passing on reference +data coming from hundreds of sources, each of which not necessarily agreeing on +a common granularity for citation data and from artifacts of machine learning +based structured data extraction tools. Each combination of fields may require a slightly different processing path. For example, references with an Arxiv identifier can be processed differently @@ -262,8 +264,17 @@ harder or not possible at all. \section{Appendix A} + + +A note on data quality: While we implement various data quality measures, +real-world data, especially coming from many different sources will contain +errors and bugs. Among other measures, we keep track of match reasons, +especially for fuzzy matching to be able to zoom in on systematic errors a bit +more easily (see~Table~\ref{table:matches}). + \begin{table}[] \footnotesize + \captionsetup{font=normalsize} \begin{center} \begin{tabular}{@{}rlll@{}} \toprule @@ -319,11 +330,11 @@ harder or not possible at all. 1 & fatcat & exact & doi \\ \bottomrule \end{tabular} \vspace*{2mm} - \caption{Table of match counts, reference provenance, match status and + \caption{Table of match counts, reference provenance, match status and match reason. The match reason identifier encode a specific rule in the domain dependent verification process and are included for completeness - we do not include the details of each rule in this report.} - \label{table:fields} + \label{table:matches} \end{center} \end{table} |