aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/persist.py
Commit message (Collapse)AuthorAgeFilesLines
* changes from prodBryan Newbold2020-06-251-4/+12
|
* fixes and tweaks from testing locallyBryan Newbold2020-06-171-11/+18
|
* tweak kafka topic names and seaweedfs layoutBryan Newbold2020-06-171-1/+2
|
* add new pdf workers/persistersBryan Newbold2020-06-171-0/+99
|
* workers: refactor to pass key to process()Bryan Newbold2020-06-171-6/+6
|
* persist: only GROBID updates file_meta, not file-resultBryan Newbold2020-04-161-1/+1
| | | | | | | | | The hope here is to reduce deadlocks in production (on aitio). As context, we are only doing "updates" until the entire file_meta table is filled in with full metadata anyways; updates are wasteful of resources, and most inserts we have seen the file before, so should be doing "DO NOTHING" if the SHA1 is already in the table.
* persist grobid: add option to skip S3 uploadBryan Newbold2020-03-191-7/+10
| | | | | | | Motivation for this is that current S3 target (minio) is overloaded, with too many files on a single partition (80 million+). Going to look in to seaweedfs and other options, but for now stopping minio persist. Data is all stored in kafka anyways.
* fixes to ingest-request persistBryan Newbold2020-03-051-3/+1
|
* persist: ingest_request tool (with no ingest_file_result)Bryan Newbold2020-03-051-0/+29
|
* pdf_trio persist fixes from prodBryan Newbold2020-02-191-1/+5
|
* include rel and oa_status in ingest request 'extra'Bryan Newbold2020-02-181-1/+1
|
* move pdf_trio results back under key in JSON/KafkaBryan Newbold2020-02-131-1/+9
|
* pdftrio basic python codeBryan Newbold2020-02-121-0/+21
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* fix persist bug where ingest_request_source not savedBryan Newbold2020-02-051-0/+1
|
* persist grobid: actually, status_code is requiredBryan Newbold2020-01-211-2/+9
| | | | | | | Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank.
* persist: work around GROBID timeouts with no status_codeBryan Newbold2020-01-211-2/+2
|
* persist worker: implement updated ingest result semanticsBryan Newbold2020-01-151-11/+16
|
* ingest persist skips 'existing' ingest resultsBryan Newbold2020-01-141-0/+3
|
* handle grobid2json errors in calling code insteadBryan Newbold2020-01-021-1/+7
|
* db: move duplicate row filtering into DB insert helpersBryan Newbold2020-01-021-15/+1
|
* remove unused filter in grobid workerBryan Newbold2020-01-021-1/+0
|
* fix dict typoBryan Newbold2020-01-021-1/+1
|
* improvements to grobid persist workerBryan Newbold2020-01-021-13/+16
|
* filter ingest results to not have key conflicts within batchBryan Newbold2020-01-021-1/+16
| | | | | This handles a corner case with ON CONFLICT ... DO UPDATE where you can't do multiple such updates in the same batch transaction.
* db: fancy insert/update separation using postgres xmaxBryan Newbold2020-01-021-9/+15
|
* add PersistGrobidDiskWorkerBryan Newbold2020-01-021-0/+33
| | | | To help with making dumps directly from Kafka (eg, for partner delivery)
* flush out minio helper, add to grobid persistBryan Newbold2020-01-021-9/+29
|
* implement counts properly for persist workersBryan Newbold2020-01-021-15/+19
|
* start work on persist workers and toolBryan Newbold2020-01-021-0/+223