aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/grobid.py
Commit message (Collapse)AuthorAgeFilesLines
* grobid: disable biblio-glutton consolidationBryan Newbold2021-04-071-3/+3
|
* differential wayback-error from wayback-content-errorBryan Newbold2020-10-211-1/+0
| | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption).
* workers: refactor to pass key to process()Bryan Newbold2020-06-171-2/+2
|
* refactor worker fetch code into wrapper classBryan Newbold2020-06-161-60/+9
|
* timeout message implementation for GROBID and ingest workersBryan Newbold2020-04-271-0/+9
|
* grobid petabox: fix fetch body/contentBryan Newbold2020-02-031-1/+1
|
* grobid worker: catch PetaboxError alsoBryan Newbold2020-01-281-2/+2
|
* grobid worker: always set a key in responseBryan Newbold2020-01-281-4/+25
| | | | | | | | | We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"}
* grobid: fix error_msg typo; set status_code for timeoutsBryan Newbold2020-01-211-1/+2
|
* add 200 second timeout to GROBID requestsBryan Newbold2020-01-171-8/+15
|
* grobid worker fixes for newer ia lib refactorsBryan Newbold2020-01-141-3/+9
|
* fix grobid tests for new wayback refactorsBryan Newbold2020-01-091-3/+3
|
* be more parsimonious with GROBID metadataBryan Newbold2020-01-021-2/+4
| | | | | Because these are getting persisted in database (as well as kafka), don't write out empty keys.
* fixes for large GROBID result skipBryan Newbold2019-12-021-2/+2
|
* count empty blobs as 'failed' instead of crashingBryan Newbold2019-12-011-1/+2
| | | | Might be better to record an artificial kafka response instead?
* cleanup unused importBryan Newbold2019-12-011-1/+0
|
* filter out very large GROBID XML bodiesBryan Newbold2019-12-011-0/+6
| | | | | | | | | | This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters).
* much progress on file ingest pathBryan Newbold2019-10-221-0/+14
|
* we do actually want consolidateHeader=2, not 1Bryan Newbold2019-10-041-3/+3
|
* grobid: consolidateHeaders typoBryan Newbold2019-10-041-1/+1
|
* disable citation consolidation by defaultBryan Newbold2019-10-041-1/+1
| | | | | | | with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks.
* fix GROBID POST flagsBryan Newbold2019-10-041-1/+3
|
* handle GROBID fetch empty blob conditionBryan Newbold2019-10-031-1/+2
|
* have grobidworker error status indicate issues instead of bailingBryan Newbold2019-10-021-4/+13
|
* more counts and bugfixes in grobid_toolBryan Newbold2019-09-261-4/+0
|
* small improvements to GROBID toolBryan Newbold2019-09-261-0/+4
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-3/+63
|
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+44