aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/persist.py
Commit message (Expand)AuthorAgeFilesLines
* persist: skip huge URLsBryan Newbold2022-09-281-0/+4
* simple persist worker/tool to backfill grobid_refsBryan Newbold2021-11-101-0/+40
* crossref grobid refs: another error case (ReadTimeout)Bryan Newbold2021-11-041-1/+5
* grobid crossref refs: try to handle HTTP 5xx and XML parse errorsBryan Newbold2021-11-041-1/+9
* crossref persist: make GROBID ref parsing an option (not default)Bryan Newbold2021-11-041-7/+16
* glue, utils, and worker code for crossref and grobid_refsBryan Newbold2021-11-041-0/+45
* remove grobid2json helper file, replace with grobid_tei_xmlBryan Newbold2021-10-271-1/+1
* small type annotation things from additional packagesBryan Newbold2021-10-271-2/+4
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-212/+238
* lint collection membership (last lint for now)Bryan Newbold2021-10-261-5/+5
* type annotations for persist workers; required some workBryan Newbold2021-10-261-66/+59
* start adding python type annotations to db and persist codeBryan Newbold2021-10-261-2/+4
* flake8 clean (with current settings)Bryan Newbold2021-10-261-2/+2
* make fmtBryan Newbold2021-10-261-27/+34
* python: isort all importsBryan Newbold2021-10-261-4/+4
* persist support for ingest platform table, using existing persist workerBryan Newbold2021-10-151-1/+62
* improve fileset ingest integration with file ingestBryan Newbold2021-10-151-1/+1
* wrap up previous renaming workBryan Newbold2021-10-151-1/+1
* persist: skip very long URLsBryan Newbold2021-04-121-0/+4
* persist: html_meta is ON CONFLICT DO UPDATEBryan Newbold2020-12-151-1/+1
* persist: don't expect HTML TEI-XML in result objectBryan Newbold2020-12-151-1/+1
* html: refactors/tweaks from testingBryan Newbold2020-11-061-1/+0
* persist: fix worker API/typing hacks (raw_key, key, key_str)Bryan Newbold2020-11-041-9/+9
* initial implementation of HTML ingest in existing workerBryan Newbold2020-11-041-3/+17
* small fixes from local testing for XML ingestBryan Newbold2020-11-031-3/+8
* persist: XML and HTML persist workersBryan Newbold2020-11-031-3/+74
* refactor 'minio' to 'seaweedfs'; and BLOB env varsBryan Newbold2020-11-031-2/+4
* changes from prodBryan Newbold2020-06-251-4/+12
* fixes and tweaks from testing locallyBryan Newbold2020-06-171-11/+18
* tweak kafka topic names and seaweedfs layoutBryan Newbold2020-06-171-1/+2
* add new pdf workers/persistersBryan Newbold2020-06-171-0/+99
* workers: refactor to pass key to process()Bryan Newbold2020-06-171-6/+6
* persist: only GROBID updates file_meta, not file-resultBryan Newbold2020-04-161-1/+1
* persist grobid: add option to skip S3 uploadBryan Newbold2020-03-191-7/+10
* fixes to ingest-request persistBryan Newbold2020-03-051-3/+1
* persist: ingest_request tool (with no ingest_file_result)Bryan Newbold2020-03-051-0/+29
* pdf_trio persist fixes from prodBryan Newbold2020-02-191-1/+5
* include rel and oa_status in ingest request 'extra'Bryan Newbold2020-02-181-1/+1
* move pdf_trio results back under key in JSON/KafkaBryan Newbold2020-02-131-1/+9
* pdftrio basic python codeBryan Newbold2020-02-121-0/+21
* fix persist bug where ingest_request_source not savedBryan Newbold2020-02-051-0/+1
* persist grobid: actually, status_code is requiredBryan Newbold2020-01-211-2/+9
* persist: work around GROBID timeouts with no status_codeBryan Newbold2020-01-211-2/+2
* persist worker: implement updated ingest result semanticsBryan Newbold2020-01-151-11/+16
* ingest persist skips 'existing' ingest resultsBryan Newbold2020-01-141-0/+3
* handle grobid2json errors in calling code insteadBryan Newbold2020-01-021-1/+7
* db: move duplicate row filtering into DB insert helpersBryan Newbold2020-01-021-15/+1
* remove unused filter in grobid workerBryan Newbold2020-01-021-1/+0
* fix dict typoBryan Newbold2020-01-021-1/+1
* improvements to grobid persist workerBryan Newbold2020-01-021-13/+16