Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | grobid persist: if status_code is not set, default to 0bnewbold-persist-grobid-errors | Bryan Newbold | 2020-01-28 | 1 | -1/+2 |
| | | | | | | | | | | | | | | | We have to set something currently because of a NOT NULL constraint on the table. Originally I thought we would just not record rows if there was an error, and that is still sort of a valid stance. However, when doing bulk GROBID-ing from cdx table, there exist some "bad" CDX rows which cause wayback or petabox errors. We should fix bugs or delete these rows as a cleanup, but until that happens we should record the error state so we don't loop forever. One danger of this commit is that we can clobber existing good rows with new errors rapidly if there is wayback downtime or something like that. | ||||
* | persist grobid: actually, status_code is required | Bryan Newbold | 2020-01-21 | 1 | -1/+1 |
| | | | | | | | Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank. | ||||
* | persist: work around GROBID timeouts with no status_code | Bryan Newbold | 2020-01-21 | 1 | -1/+1 |
| | |||||
* | persist: fix dupe field copying | Bryan Newbold | 2020-01-15 | 1 | -1/+8 |
| | | | | | | In testing hit: AttributeError: 'str' object has no attribute 'get' | ||||
* | persist worker: implement updated ingest result semantics | Bryan Newbold | 2020-01-15 | 1 | -1/+1 |
| | |||||
* | small fixups to SandcrawlerPostgrestClient | Bryan Newbold | 2020-01-14 | 1 | -1/+10 |
| | |||||
* | db: move duplicate row filtering into DB insert helpers | Bryan Newbold | 2020-01-02 | 1 | -0/+25 |
| | |||||
* | fix DB import counting | Bryan Newbold | 2020-01-02 | 1 | -4/+5 |
| | |||||
* | fix small errors found by pylint | Bryan Newbold | 2020-01-02 | 1 | -1/+1 |
| | |||||
* | db: fancy insert/update separation using postgres xmax | Bryan Newbold | 2020-01-02 | 1 | -15/+30 |
| | |||||
* | improve DB helpers | Bryan Newbold | 2020-01-02 | 1 | -26/+81 |
| | | | | | - return insert/update row counts - implement ON CONFLICT ... DO UPDATE on some tables | ||||
* | start work on DB connector and minio client | Bryan Newbold | 2020-01-02 | 1 | -0/+141 |