Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID) | Bryan Newbold | 2022-05-16 | 1 | -0/+9 |
| | |||||
* | grobid: set a maximum file size (256 MByte) | Bryan Newbold | 2021-12-07 | 1 | -0/+8 |
| | |||||
* | make fmt | Bryan Newbold | 2021-11-16 | 1 | -1/+1 |
| | |||||
* | grobid: handle XML parsing errors, and have them recorded in sandcrawler-db | Bryan Newbold | 2021-11-12 | 1 | -1/+5 |
| | |||||
* | grobid: extract more metadata in document TEI-XML | Bryan Newbold | 2021-11-10 | 1 | -0/+5 |
| | |||||
* | grobid: update 'TODO' comment based on review | Bryan Newbold | 2021-11-04 | 1 | -3/+0 |
| | |||||
* | crossref grobid refs: another error case (ReadTimeout) | Bryan Newbold | 2021-11-04 | 1 | -4/+6 |
| | | | | | With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML). | ||||
* | grobid: use requests session | Bryan Newbold | 2021-11-04 | 1 | -3/+4 |
| | | | | | | This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address')) | ||||
* | grobid crossref refs: try to handle HTTP 5xx and XML parse errors | Bryan Newbold | 2021-11-04 | 1 | -4/+24 |
| | |||||
* | grobid: handle weird whitespace unstructured from crossref | Bryan Newbold | 2021-11-04 | 1 | -1/+10 |
| | | | | See also: https://github.com/kermitt2/grobid/issues/849 | ||||
* | iterated GROBID citation cleaning and processing | Bryan Newbold | 2021-11-04 | 1 | -27/+45 |
| | | | | Switched to using just 'key'/'id' for downstream matching. | ||||
* | grobid citations: first pass at cleaning unstructured | Bryan Newbold | 2021-11-04 | 1 | -2/+34 |
| | |||||
* | initial crossref-refs via GROBID helper routine | Bryan Newbold | 2021-11-04 | 1 | -4/+121 |
| | |||||
* | remove grobid2json helper file, replace with grobid_tei_xml | Bryan Newbold | 2021-10-27 | 1 | -3/+4 |
| | |||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 1 | -50/+55 |
| | |||||
* | fix type annotations for petabox body fetch helper | Bryan Newbold | 2021-10-26 | 1 | -1/+2 |
| | |||||
* | more progress on type annotations | Bryan Newbold | 2021-10-26 | 1 | -1/+3 |
| | |||||
* | grobid: fix a bug with consolidate_mode header, exposed by type annotations | Bryan Newbold | 2021-10-26 | 1 | -1/+2 |
| | |||||
* | grobid: type annotations | Bryan Newbold | 2021-10-26 | 1 | -9/+19 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 1 | -3/+1 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 1 | -13/+17 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -1/+3 |
| | |||||
* | grobid: disable biblio-glutton consolidation | Bryan Newbold | 2021-04-07 | 1 | -3/+3 |
| | |||||
* | differential wayback-error from wayback-content-error | Bryan Newbold | 2020-10-21 | 1 | -1/+0 |
| | | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption). | ||||
* | workers: refactor to pass key to process() | Bryan Newbold | 2020-06-17 | 1 | -2/+2 |
| | |||||
* | refactor worker fetch code into wrapper class | Bryan Newbold | 2020-06-16 | 1 | -60/+9 |
| | |||||
* | timeout message implementation for GROBID and ingest workers | Bryan Newbold | 2020-04-27 | 1 | -0/+9 |
| | |||||
* | grobid petabox: fix fetch body/content | Bryan Newbold | 2020-02-03 | 1 | -1/+1 |
| | |||||
* | grobid worker: catch PetaboxError also | Bryan Newbold | 2020-01-28 | 1 | -2/+2 |
| | |||||
* | grobid worker: always set a key in response | Bryan Newbold | 2020-01-28 | 1 | -4/+25 |
| | | | | | | | | | We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"} | ||||
* | grobid: fix error_msg typo; set status_code for timeouts | Bryan Newbold | 2020-01-21 | 1 | -1/+2 |
| | |||||
* | add 200 second timeout to GROBID requests | Bryan Newbold | 2020-01-17 | 1 | -8/+15 |
| | |||||
* | grobid worker fixes for newer ia lib refactors | Bryan Newbold | 2020-01-14 | 1 | -3/+9 |
| | |||||
* | fix grobid tests for new wayback refactors | Bryan Newbold | 2020-01-09 | 1 | -3/+3 |
| | |||||
* | be more parsimonious with GROBID metadata | Bryan Newbold | 2020-01-02 | 1 | -2/+4 |
| | | | | | Because these are getting persisted in database (as well as kafka), don't write out empty keys. | ||||
* | fixes for large GROBID result skip | Bryan Newbold | 2019-12-02 | 1 | -2/+2 |
| | |||||
* | count empty blobs as 'failed' instead of crashing | Bryan Newbold | 2019-12-01 | 1 | -1/+2 |
| | | | | Might be better to record an artificial kafka response instead? | ||||
* | cleanup unused import | Bryan Newbold | 2019-12-01 | 1 | -1/+0 |
| | |||||
* | filter out very large GROBID XML bodies | Bryan Newbold | 2019-12-01 | 1 | -0/+6 |
| | | | | | | | | | | This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters). | ||||
* | much progress on file ingest path | Bryan Newbold | 2019-10-22 | 1 | -0/+14 |
| | |||||
* | we do actually want consolidateHeader=2, not 1 | Bryan Newbold | 2019-10-04 | 1 | -3/+3 |
| | |||||
* | grobid: consolidateHeaders typo | Bryan Newbold | 2019-10-04 | 1 | -1/+1 |
| | |||||
* | disable citation consolidation by default | Bryan Newbold | 2019-10-04 | 1 | -1/+1 |
| | | | | | | | with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks. | ||||
* | fix GROBID POST flags | Bryan Newbold | 2019-10-04 | 1 | -1/+3 |
| | |||||
* | handle GROBID fetch empty blob condition | Bryan Newbold | 2019-10-03 | 1 | -1/+2 |
| | |||||
* | have grobidworker error status indicate issues instead of bailing | Bryan Newbold | 2019-10-02 | 1 | -4/+13 |
| | |||||
* | more counts and bugfixes in grobid_tool | Bryan Newbold | 2019-09-26 | 1 | -4/+0 |
| | |||||
* | small improvements to GROBID tool | Bryan Newbold | 2019-09-26 | 1 | -0/+4 |
| | |||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 1 | -3/+63 |
| | |||||
* | start refactoring sandcrawler python common code | Bryan Newbold | 2019-09-23 | 1 | -0/+44 |