aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/db.py
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-01-28 19:06:25 -0800
committerBryan Newbold <bnewbold@archive.org>2020-01-28 19:10:40 -0800
commit480dc3fa20102ba0a15013954e76d3b1f826026c (patch)
tree8000344605d23ffbec4a57e518c052f5b77e304b /python/sandcrawler/db.py
parent9f53880c746b9fd84261e3ab7dbbee81501df394 (diff)
downloadsandcrawler-480dc3fa20102ba0a15013954e76d3b1f826026c.tar.gz
sandcrawler-480dc3fa20102ba0a15013954e76d3b1f826026c.zip
grobid persist: if status_code is not set, default to 0bnewbold-persist-grobid-errors
We have to set something currently because of a NOT NULL constraint on the table. Originally I thought we would just not record rows if there was an error, and that is still sort of a valid stance. However, when doing bulk GROBID-ing from cdx table, there exist some "bad" CDX rows which cause wayback or petabox errors. We should fix bugs or delete these rows as a cleanup, but until that happens we should record the error state so we don't loop forever. One danger of this commit is that we can clobber existing good rows with new errors rapidly if there is wayback downtime or something like that.
Diffstat (limited to 'python/sandcrawler/db.py')
-rw-r--r--python/sandcrawler/db.py3
1 files changed, 2 insertions, 1 deletions
diff --git a/python/sandcrawler/db.py b/python/sandcrawler/db.py
index 3ec325e..5662b32 100644
--- a/python/sandcrawler/db.py
+++ b/python/sandcrawler/db.py
@@ -161,7 +161,8 @@ class SandcrawlerPostgresClient:
r['metadata'] = json.dumps(r['metadata'], sort_keys=True)
batch = [(d['key'],
d.get('grobid_version') or None,
- d['status_code'],
+ # status_code is validly not set if there was, eg, error-wayback in grobid-worker
+ d.get('status_code') or 0,
d['status'],
d.get('fatcat_release') or None,
d.get('updated') or datetime.datetime.now(),