aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/grobid.py
Commit message (Expand)AuthorAgeFilesLines
* ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)Bryan Newbold2022-05-161-0/+9
* grobid: set a maximum file size (256 MByte)Bryan Newbold2021-12-071-0/+8
* make fmtBryan Newbold2021-11-161-1/+1
* grobid: handle XML parsing errors, and have them recorded in sandcrawler-dbBryan Newbold2021-11-121-1/+5
* grobid: extract more metadata in document TEI-XMLBryan Newbold2021-11-101-0/+5
* grobid: update 'TODO' comment based on reviewBryan Newbold2021-11-041-3/+0
* crossref grobid refs: another error case (ReadTimeout)Bryan Newbold2021-11-041-4/+6
* grobid: use requests sessionBryan Newbold2021-11-041-3/+4
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-041-4/+24
* grobid: handle weird whitespace unstructured from crossrefBryan Newbold2021-11-041-1/+10
* iterated GROBID citation cleaning and processingBryan Newbold2021-11-041-27/+45
* grobid citations: first pass at cleaning unstructuredBryan Newbold2021-11-041-2/+34
* initial crossref-refs via GROBID helper routineBryan Newbold2021-11-041-4/+121
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-271-3/+4
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-50/+55
* fix type annotations for petabox body fetch helperBryan Newbold2021-10-261-1/+2
* more progress on type annotationsBryan Newbold2021-10-261-1/+3
* grobid: fix a bug with consolidate_mode header, exposed by type annotationsBryan Newbold2021-10-261-1/+2
* grobid: type annotationsBryan Newbold2021-10-261-9/+19
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-3/+1
* make fmtBryan Newbold2021-10-261-13/+17
* python: isort all importsBryan Newbold2021-10-261-1/+3
* grobid: disable biblio-glutton consolidationBryan Newbold2021-04-071-3/+3
* differential wayback-error from wayback-content-errorBryan Newbold2020-10-211-1/+0
* workers: refactor to pass key to process()Bryan Newbold2020-06-171-2/+2
* refactor worker fetch code into wrapper classBryan Newbold2020-06-161-60/+9
* timeout message implementation for GROBID and ingest workersBryan Newbold2020-04-271-0/+9
* grobid petabox: fix fetch body/contentBryan Newbold2020-02-031-1/+1
* grobid worker: catch PetaboxError alsoBryan Newbold2020-01-281-2/+2
* grobid worker: always set a key in responseBryan Newbold2020-01-281-4/+25
* grobid: fix error_msg typo; set status_code for timeoutsBryan Newbold2020-01-211-1/+2
* add 200 second timeout to GROBID requestsBryan Newbold2020-01-171-8/+15
* grobid worker fixes for newer ia lib refactorsBryan Newbold2020-01-141-3/+9
* fix grobid tests for new wayback refactorsBryan Newbold2020-01-091-3/+3
* be more parsimonious with GROBID metadataBryan Newbold2020-01-021-2/+4
* fixes for large GROBID result skipBryan Newbold2019-12-021-2/+2
* count empty blobs as 'failed' instead of crashingBryan Newbold2019-12-011-1/+2
* cleanup unused importBryan Newbold2019-12-011-1/+0
* filter out very large GROBID XML bodiesBryan Newbold2019-12-011-0/+6
* much progress on file ingest pathBryan Newbold2019-10-221-0/+14
* we do actually want consolidateHeader=2, not 1Bryan Newbold2019-10-041-3/+3
* grobid: consolidateHeaders typoBryan Newbold2019-10-041-1/+1
* disable citation consolidation by defaultBryan Newbold2019-10-041-1/+1
* fix GROBID POST flagsBryan Newbold2019-10-041-1/+3
* handle GROBID fetch empty blob conditionBryan Newbold2019-10-031-1/+2
* have grobidworker error status indicate issues instead of bailingBryan Newbold2019-10-021-4/+13
* more counts and bugfixes in grobid_toolBryan Newbold2019-09-261-4/+0
* small improvements to GROBID toolBryan Newbold2019-09-261-0/+4
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-3/+63
* start refactoring sandcrawler python common codeBryan Newbold2019-09-231-0/+44