aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/grobid.py
Commit message (Collapse)AuthorAgeFilesLines
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-161-0/+9
|
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
|
* make fmtBryan Newbold2021-11-161-1/+1
|
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
|
* grobid: extract more metadata in document TEI-XMLBryan Newbold2021-11-101-0/+5
|
* grobid: update 'TODO' comment based on reviewBryan Newbold2021-11-041-3/+0
|
* crossref grobid refs: another error case (ReadTimeout)Bryan Newbold2021-11-041-4/+6
| | | | | With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML).
* grobid: use requests sessionBryan Newbold2021-11-041-3/+4
| | | | | | This should fix an embarassing bug with exhausting local ports: requests.exceptions.ConnectionError: HTTPConnectionPool(host='wbgrp-svc096.us.archive.org', port=8070): Max retries exceeded with url: /api/processCitationList (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8dfc24e250>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-041-4/+24
|
* grobid: handle weird whitespace unstructured from crossrefBryan Newbold2021-11-041-1/+10
| | | | See also: https://github.com/kermitt2/grobid/issues/849
* iterated GROBID citation cleaning and processingBryan Newbold2021-11-041-27/+45
| | | | Switched to using just 'key'/'id' for downstream matching.
* grobid citations: first pass at cleaning unstructuredBryan Newbold2021-11-041-2/+34
|
* initial crossref-refs via GROBID helper routineBryan Newbold2021-11-041-4/+121
|
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-271-3/+4
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-50/+55
|
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-261-1/+2
|
* more progress on type annotationsBryan Newbold2021-10-261-1/+3
|
* grobid: fix a bug with consolidate_mode header, exposed by type annotationsBryan Newbold2021-10-261-1/+2
|
* grobid: type annotationsBryan Newbold2021-10-261-9/+19
|
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-3/+1
|
* make fmtBryan Newbold2021-10-261-13/+17
|
* python: isort all importsBryan Newbold2021-10-261-1/+3
|
* grobid: disable biblio-glutton consolidationBryan Newbold2021-04-071-3/+3
|
* differential wayback-error from wayback-content-errorBryan Newbold2020-10-211-1/+0
| | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption).
* workers: refactor to pass key to process()Bryan Newbold2020-06-171-2/+2
|
* refactor worker fetch code into wrapper classBryan Newbold2020-06-161-60/+9
|
* timeout message implementation for GROBID and ingest workersBryan Newbold2020-04-271-0/+9
|
* grobid petabox: fix fetch body/contentBryan Newbold2020-02-031-1/+1
|
* grobid worker: catch PetaboxError alsoBryan Newbold2020-01-281-2/+2
|
* grobid worker: always set a key in responseBryan Newbold2020-01-281-4/+25
| | | | | | | | | We have key-based compaction enabled for the GROBID output topic. This means it is an error to public to that topic without a key set. Hopefully this change will end these errors, which look like: KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"}
* grobid: fix error_msg typo; set status_code for timeoutsBryan Newbold2020-01-211-1/+2
|
* add 200 second timeout to GROBID requestsBryan Newbold2020-01-171-8/+15
|
* grobid worker fixes for newer ia lib refactorsBryan Newbold2020-01-141-3/+9
|
* fix grobid tests for new wayback refactorsBryan Newbold2020-01-091-3/+3
|
* be more parsimonious with GROBID metadataBryan Newbold2020-01-021-2/+4
| | | | | Because these are getting persisted in database (as well as kafka), don't write out empty keys.
* fixes for large GROBID result skipBryan Newbold2019-12-021-2/+2
|
* count empty blobs as 'failed' instead of crashingBryan Newbold2019-12-011-1/+2
| | | | Might be better to record an artificial kafka response instead?
* cleanup unused importBryan Newbold2019-12-011-1/+0
|
* filter out very large GROBID XML bodiesBryan Newbold2019-12-011-0/+6
| | | | | | | | | | This is to prevent Kafka MSG_SIZE_TOO_LARGE publish errors. We should probably bump this in the future. Open problems: hand-coding this size number isn't good, need to update in two places. Shouldn't filter out for non-Kafka sinks. Might still exist a corner-case where JSON encoded XML is larger than XML character string, due to encoding (eg, for unicode characters).
* much progress on file ingest pathBryan Newbold2019-10-221-0/+14
|
* we do actually want consolidateHeader=2, not 1Bryan Newbold2019-10-041-3/+3
|
* grobid: consolidateHeaders typoBryan Newbold2019-10-041-1/+1
|
* disable citation consolidation by defaultBryan Newbold2019-10-041-1/+1
| | | | | | | with this consolidation enabled, the glutton_fatcat elasticsearch server was totally pegged over 90% CPU with only 10 PDF worker threads; the glutton load seemed to be the bottleneck even for this low degree of parallelism. Disabled for now, will debug with GROBID/glutton folks.
* fix GROBID POST flagsBryan Newbold2019-10-041-1/+3
|
* handle GROBID fetch empty blob conditionBryan Newbold2019-10-031-1/+2
|
* have grobidworker error status indicate issues instead of bailingBryan Newbold2019-10-021-4/+13
|
* more counts and bugfixes in grobid_toolBryan Newbold2019-09-261-4/+0
|
* small improvements to GROBID toolBryan Newbold2019-09-261-0/+4
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-3/+63
|
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+44