aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/db.py
Commit message (Collapse)AuthorAgeFilesLines
* html: start on SQL tableBryan Newbold2020-11-031-0/+44
|
* fixes and tweaks from testing locallyBryan Newbold2020-06-171-0/+47
|
* pdf_trio persist fixes from prodBryan Newbold2020-02-191-4/+4
|
* include rel and oa_status in ingest request 'extra'Bryan Newbold2020-02-181-1/+1
|
* pdftrio basic python codeBryan Newbold2020-02-121-0/+57
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* fix bug where ingest_request extra fields not persistedBryan Newbold2020-02-051-1/+2
|
* persist grobid: actually, status_code is requiredBryan Newbold2020-01-211-1/+1
| | | | | | | Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank.
* persist: work around GROBID timeouts with no status_codeBryan Newbold2020-01-211-1/+1
|
* persist: fix dupe field copyingBryan Newbold2020-01-151-1/+8
| | | | | | In testing hit: AttributeError: 'str' object has no attribute 'get'
* persist worker: implement updated ingest result semanticsBryan Newbold2020-01-151-1/+1
|
* small fixups to SandcrawlerPostgrestClientBryan Newbold2020-01-141-1/+10
|
* db: move duplicate row filtering into DB insert helpersBryan Newbold2020-01-021-0/+25
|
* fix DB import countingBryan Newbold2020-01-021-4/+5
|
* fix small errors found by pylintBryan Newbold2020-01-021-1/+1
|
* db: fancy insert/update separation using postgres xmaxBryan Newbold2020-01-021-15/+30
|
* improve DB helpersBryan Newbold2020-01-021-26/+81
| | | | | - return insert/update row counts - implement ON CONFLICT ... DO UPDATE on some tables
* start work on DB connector and minio clientBryan Newbold2020-01-021-0/+141