diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-06-03 19:30:15 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-06-03 19:32:50 -0700 |
commit | f9035c7ca9637668911afa7e9345138563aad33e (patch) | |
tree | f6bd0f817190e315d9e8b0016ab1a7e0d5c73c7f /tests | |
parent | 9722f39e38a45d3201c836f0c2805ae9f6c1f581 (diff) | |
download | fatcat-scholar-f9035c7ca9637668911afa7e9345138563aad33e.tar.gz fatcat-scholar-f9035c7ca9637668911afa7e9345138563aad33e.zip |
improve text scrubbing
Was going to use textpipe, but dependency was too large and failed to
install with halfway modern GCC (due to CLD2 issue):
https://github.com/GregBowyer/cld2-cffi/issues/12
So instead basically pulled out the clean_text function, which is quite
short.
Diffstat (limited to 'tests')
-rw-r--r-- | tests/test_scrub.py | 15 |
1 files changed, 15 insertions, 0 deletions
diff --git a/tests/test_scrub.py b/tests/test_scrub.py new file mode 100644 index 0000000..6c357ae --- /dev/null +++ b/tests/test_scrub.py @@ -0,0 +1,15 @@ + +import pytest + +from fatcat_scholar.schema import * + + +def test_scrub(): + vectors = [ + ('“Please clean this piece… of text</b>„', '"Please clean this piece... of text"'), + ("<jats:p>blah", "blah"), + ] + + for raw, fixed in vectors: + assert fixed == scrub_text(raw) + |