diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-08-03 00:46:12 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-08-03 00:46:12 +0200 |
commit | 730612615d6c3919f98cbb5aeaa9956b8b1a65c7 (patch) | |
tree | c99afb718f110bd4b55d487238afae550531e5e5 /docs/TR-20210730212057-IA-WDS-CG/main.tex | |
parent | 26531d7becbf37f8e79681af304ec8a67efc9bf7 (diff) | |
download | refcat-730612615d6c3919f98cbb5aeaa9956b8b1a65c7.tar.gz refcat-730612615d6c3919f98cbb5aeaa9956b8b1a65c7.zip |
update docs
Diffstat (limited to 'docs/TR-20210730212057-IA-WDS-CG/main.tex')
-rw-r--r-- | docs/TR-20210730212057-IA-WDS-CG/main.tex | 13 |
1 files changed, 13 insertions, 0 deletions
diff --git a/docs/TR-20210730212057-IA-WDS-CG/main.tex b/docs/TR-20210730212057-IA-WDS-CG/main.tex index faeab73..a7edac3 100644 --- a/docs/TR-20210730212057-IA-WDS-CG/main.tex +++ b/docs/TR-20210730212057-IA-WDS-CG/main.tex @@ -246,6 +246,19 @@ time limited. Map and reduce operations are parallelized and certain processing steps can process 100K documents per second or even more on commodity hardware with spinning disks. +\section{Quality Assurance} + +Understanding data quality plays a role, as the data is coming from a myriad of +sources, each with possible idiosyncratic features or missing values. We employ +a few QA measures during the process. First, we try to pass each data item +through only one processing pipeline (e.g. items matched by any identifier +should not even be considered for fuzzy matching). If duplicate links appear in +the final dataset nonetheless, we remove them, prefering exact over fuzzy matches. + +We employ a couple of data cleaning techniques, e.g. to find and verify +identifiers like ISBN or to sanitize URLs found in the data. Many of these +artifacts stem from the fact that large chunks of the raw data come from +heuristic data extraction from PDF documents. \section{Discussion} |