filter out very large GROBID XML bodies

This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters).
author: Bryan Newbold <bnewbold@archive.org> 2019-12-01 15:42:05 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2019-12-01 15:42:07 -0800
commit: aa49ab86e4a86067ba2346d8bccf389be940b8e2 (patch)
tree: 19272451a626fdf7cfc8cdbfe581cf7fe3ede05d /python
parent: e28125db2735b53e28ab5148379cb8b804c184c6 (diff)
download: sandcrawler-aa49ab86e4a86067ba2346d8bccf389be940b8e2.tar.gz
sandcrawler-aa49ab86e4a86067ba2346d8bccf389be940b8e2.zip
1 files changed, 6 insertions, 0 deletions
diff --git a/python/sandcrawler/grobid.py b/python/sandcrawler/grobid.py
index d83fedc..31dc270 100644
--- a/python/sandcrawler/grobid.py
+++ b/python/sandcrawler/grobid.py
@@ -44,6 +44,12 @@ class GrobidClient(object):
         if grobid_response.status_code == 200:
             info['status'] = 'success'
             info['tei_xml'] = grobid_response.text
+            if len(info['tei_xml']) > 19500000:
+                # XML is larger than Kafka message size, and much larger than
+                # an article in general; bail out
+                info['status'] = 'error'
+                info['error_msg'] = "response XML too large: {} bytes".format(len(len(info['tei_xml'])))
+                info.pop('tei_xml')
         else:
             # response.text is .content decoded as utf-8
             info['status'] = 'error'
author	Bryan Newbold <bnewbold@archive.org>	2019-12-01 15:42:05 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2019-12-01 15:42:07 -0800
commit	aa49ab86e4a86067ba2346d8bccf389be940b8e2 (patch)
tree	19272451a626fdf7cfc8cdbfe581cf7fe3ede05d /python
parent	e28125db2735b53e28ab5148379cb8b804c184c6 (diff)
download	sandcrawler-aa49ab86e4a86067ba2346d8bccf389be940b8e2.tar.gz sandcrawler-aa49ab86e4a86067ba2346d8bccf389be940b8e2.zip