diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-01-28 19:06:25 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-01-28 19:10:40 -0800 |
commit | 480dc3fa20102ba0a15013954e76d3b1f826026c (patch) | |
tree | 8000344605d23ffbec4a57e518c052f5b77e304b /mapreduce/extraction_cdx_grobid.py | |
parent | 9f53880c746b9fd84261e3ab7dbbee81501df394 (diff) | |
download | sandcrawler-480dc3fa20102ba0a15013954e76d3b1f826026c.tar.gz sandcrawler-480dc3fa20102ba0a15013954e76d3b1f826026c.zip |
grobid persist: if status_code is not set, default to 0bnewbold-persist-grobid-errors
We have to set something currently because of a NOT NULL constraint on
the table.
Originally I thought we would just not record rows if there was an
error, and that is still sort of a valid stance. However, when doing
bulk GROBID-ing from cdx table, there exist some "bad" CDX rows which
cause wayback or petabox errors. We should fix bugs or delete these rows
as a cleanup, but until that happens we should record the error state so
we don't loop forever.
One danger of this commit is that we can clobber existing good rows with
new errors rapidly if there is wayback downtime or something like that.
Diffstat (limited to 'mapreduce/extraction_cdx_grobid.py')
0 files changed, 0 insertions, 0 deletions