aboutsummaryrefslogtreecommitdiffstats
path: root/python/title_slug_blacklist.txt
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-12-01 15:42:05 -0800
committerBryan Newbold <bnewbold@archive.org>2019-12-01 15:42:07 -0800
commitaa49ab86e4a86067ba2346d8bccf389be940b8e2 (patch)
tree19272451a626fdf7cfc8cdbfe581cf7fe3ede05d /python/title_slug_blacklist.txt
parente28125db2735b53e28ab5148379cb8b804c184c6 (diff)
downloadsandcrawler-aa49ab86e4a86067ba2346d8bccf389be940b8e2.tar.gz
sandcrawler-aa49ab86e4a86067ba2346d8bccf389be940b8e2.zip
filter out very large GROBID XML bodies
This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters).
Diffstat (limited to 'python/title_slug_blacklist.txt')
0 files changed, 0 insertions, 0 deletions