aboutsummaryrefslogtreecommitdiffstats
path: root/tests/test_scrub.py
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-06-03 19:30:15 -0700
committerBryan Newbold <bnewbold@archive.org>2020-06-03 19:32:50 -0700
commitf9035c7ca9637668911afa7e9345138563aad33e (patch)
treef6bd0f817190e315d9e8b0016ab1a7e0d5c73c7f /tests/test_scrub.py
parent9722f39e38a45d3201c836f0c2805ae9f6c1f581 (diff)
downloadfatcat-scholar-f9035c7ca9637668911afa7e9345138563aad33e.tar.gz
fatcat-scholar-f9035c7ca9637668911afa7e9345138563aad33e.zip
improve text scrubbing
Was going to use textpipe, but dependency was too large and failed to install with halfway modern GCC (due to CLD2 issue): https://github.com/GregBowyer/cld2-cffi/issues/12 So instead basically pulled out the clean_text function, which is quite short.
Diffstat (limited to 'tests/test_scrub.py')
-rw-r--r--tests/test_scrub.py15
1 files changed, 15 insertions, 0 deletions
diff --git a/tests/test_scrub.py b/tests/test_scrub.py
new file mode 100644
index 0000000..6c357ae
--- /dev/null
+++ b/tests/test_scrub.py
@@ -0,0 +1,15 @@
+
+import pytest
+
+from fatcat_scholar.schema import *
+
+
+def test_scrub():
+ vectors = [
+ ('“Please clean this piece… of text</b>„', '"Please clean this piece... of text"'),
+ ("<jats:p>blah", "blah"),
+ ]
+
+ for raw, fixed in vectors:
+ assert fixed == scrub_text(raw)
+