Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | differential wayback-error from wayback-content-error | Bryan Newbold | 2020-10-21 | 1 | -1/+0 |
| | | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption). | ||||
* | workers: refactor to pass key to process() | Bryan Newbold | 2020-06-17 | 1 | -2/+2 |
| | |||||
* | refactor worker fetch code into wrapper class | Bryan Newbold | 2020-06-16 | 1 | -60/+9 |
| | |||||
* | timeout message implementation for GROBID and ingest workers | Bryan Newbold | 2020-04-27 | 1 | -0/+9 |
| | |||||
* | grobid petabox: fix fetch body/content | Bryan Newbold | 2020-02-03 | 1 | -1/+1 |
| | |||||
* | grobid worker: catch PetaboxError also | Bryan Newbold | 2020-01-28 | 1 | -2/+2 |
| | |||||
* | grobid worker: always set a key in response | Bryan Newbold | 2020-01-28 | 1 | -4/+25 |
| | | | | | | | | | We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"} | ||||
* | grobid: fix error_msg typo; set status_code for timeouts | Bryan Newbold | 2020-01-21 | 1 | -1/+2 |
| | |||||
* | add 200 second timeout to GROBID requests | Bryan Newbold | 2020-01-17 | 1 | -8/+15 |
| | |||||
* | grobid worker fixes for newer ia lib refactors | Bryan Newbold | 2020-01-14 | 1 | -3/+9 |
| | |||||
* | fix grobid tests for new wayback refactors | Bryan Newbold | 2020-01-09 | 1 | -3/+3 |
| | |||||
* | be more parsimonious with GROBID metadata | Bryan Newbold | 2020-01-02 | 1 | -2/+4 |
| | | | | | Because these are getting persisted in database (as well as kafka), don't write out empty keys. | ||||
* | fixes for large GROBID result skip | Bryan Newbold | 2019-12-02 | 1 | -2/+2 |
| | |||||
* | count empty blobs as 'failed' instead of crashing | Bryan Newbold | 2019-12-01 | 1 | -1/+2 |
| | | | | Might be better to record an artificial kafka response instead? | ||||
* | cleanup unused import | Bryan Newbold | 2019-12-01 | 1 | -1/+0 |
| | |||||
* | filter out very large GROBID XML bodies | Bryan Newbold | 2019-12-01 | 1 | -0/+6 |
| | | | | | | | | | | This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters). | ||||
* | much progress on file ingest path | Bryan Newbold | 2019-10-22 | 1 | -0/+14 |
| | |||||
* | we do actually want consolidateHeader=2, not 1 | Bryan Newbold | 2019-10-04 | 1 | -3/+3 |
| | |||||
* | grobid: consolidateHeaders typo | Bryan Newbold | 2019-10-04 | 1 | -1/+1 |
| | |||||
* | disable citation consolidation by default | Bryan Newbold | 2019-10-04 | 1 | -1/+1 |
| | | | | | | | with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks. | ||||
* | fix GROBID POST flags | Bryan Newbold | 2019-10-04 | 1 | -1/+3 |
| | |||||
* | handle GROBID fetch empty blob condition | Bryan Newbold | 2019-10-03 | 1 | -1/+2 |
| | |||||
* | have grobidworker error status indicate issues instead of bailing | Bryan Newbold | 2019-10-02 | 1 | -4/+13 |
| | |||||
* | more counts and bugfixes in grobid_tool | Bryan Newbold | 2019-09-26 | 1 | -4/+0 |
| | |||||
* | small improvements to GROBID tool | Bryan Newbold | 2019-09-26 | 1 | -0/+4 |
| | |||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 1 | -3/+63 |
| | |||||
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 1 | -0/+44 |