From 730612615d6c3919f98cbb5aeaa9956b8b1a65c7 Mon Sep 17 00:00:00 2001
From: Martin Czygan <martin.czygan@gmail.com>
Date: Tue, 3 Aug 2021 00:46:12 +0200
Subject: update docs

---
 docs/TR-20210730212057-IA-WDS-CG/main.tex | 13 +++++++++++++
 1 file changed, 13 insertions(+)

(limited to 'docs/TR-20210730212057-IA-WDS-CG/main.tex')

diff --git a/docs/TR-20210730212057-IA-WDS-CG/main.tex b/docs/TR-20210730212057-IA-WDS-CG/main.tex
index faeab73..a7edac3 100644
--- a/docs/TR-20210730212057-IA-WDS-CG/main.tex
+++ b/docs/TR-20210730212057-IA-WDS-CG/main.tex
@@ -246,6 +246,19 @@ time limited. Map and reduce operations are parallelized and certain processing
 steps can process 100K documents per second or even more on commodity hardware
 with spinning disks.
 
+\section{Quality Assurance}
+
+Understanding data quality plays a role, as the data is coming from a myriad of
+sources, each with possible idiosyncratic features or missing values. We employ
+a few QA measures during the process. First, we try to pass each data item
+through only one processing pipeline (e.g. items matched by any identifier
+should not even be considered for fuzzy matching). If duplicate links appear in
+the final dataset nonetheless, we remove them, prefering exact over fuzzy matches.
+
+We employ a couple of data cleaning techniques, e.g. to find and verify
+identifiers like ISBN or to sanitize URLs found in the data. Many of these
+artifacts stem from the fact that large chunks of the raw data come from
+heuristic data extraction from PDF documents.
 
 
 \section{Discussion}
-- 
cgit v1.2.3