Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | html: refactors/tweaks from testing | Bryan Newbold | 2020-11-06 | 1 | -1/+0 |
| | |||||
* | persist: fix worker API/typing hacks (raw_key, key, key_str) | Bryan Newbold | 2020-11-04 | 1 | -9/+9 |
| | |||||
* | initial implementation of HTML ingest in existing worker | Bryan Newbold | 2020-11-04 | 1 | -3/+17 |
| | |||||
* | small fixes from local testing for XML ingest | Bryan Newbold | 2020-11-03 | 1 | -3/+8 |
| | |||||
* | persist: XML and HTML persist workers | Bryan Newbold | 2020-11-03 | 1 | -3/+74 |
| | |||||
* | refactor 'minio' to 'seaweedfs'; and BLOB env vars | Bryan Newbold | 2020-11-03 | 1 | -2/+4 |
| | | | | | This goes along with changes to ansible deployment to use the correct key names and values. | ||||
* | changes from prod | Bryan Newbold | 2020-06-25 | 1 | -4/+12 |
| | |||||
* | fixes and tweaks from testing locally | Bryan Newbold | 2020-06-17 | 1 | -11/+18 |
| | |||||
* | tweak kafka topic names and seaweedfs layout | Bryan Newbold | 2020-06-17 | 1 | -1/+2 |
| | |||||
* | add new pdf workers/persisters | Bryan Newbold | 2020-06-17 | 1 | -0/+99 |
| | |||||
* | workers: refactor to pass key to process() | Bryan Newbold | 2020-06-17 | 1 | -6/+6 |
| | |||||
* | persist: only GROBID updates file_meta, not file-result | Bryan Newbold | 2020-04-16 | 1 | -1/+1 |
| | | | | | | | | | The hope here is to reduce deadlocks in production (on aitio). As context, we are only doing "updates" until the entire file_meta table is filled in with full metadata anyways; updates are wasteful of resources, and most inserts we have seen the file before, so should be doing "DO NOTHING" if the SHA1 is already in the table. | ||||
* | persist grobid: add option to skip S3 upload | Bryan Newbold | 2020-03-19 | 1 | -7/+10 |
| | | | | | | | Motivation for this is that current S3 target (minio) is overloaded, with too many files on a single partition (80 million+). Going to look in to seaweedfs and other options, but for now stopping minio persist. Data is all stored in kafka anyways. | ||||
* | fixes to ingest-request persist | Bryan Newbold | 2020-03-05 | 1 | -3/+1 |
| | |||||
* | persist: ingest_request tool (with no ingest_file_result) | Bryan Newbold | 2020-03-05 | 1 | -0/+29 |
| | |||||
* | pdf_trio persist fixes from prod | Bryan Newbold | 2020-02-19 | 1 | -1/+5 |
| | |||||
* | include rel and oa_status in ingest request 'extra' | Bryan Newbold | 2020-02-18 | 1 | -1/+1 |
| | |||||
* | move pdf_trio results back under key in JSON/Kafka | Bryan Newbold | 2020-02-13 | 1 | -1/+9 |
| | |||||
* | pdftrio basic python code | Bryan Newbold | 2020-02-12 | 1 | -0/+21 |
| | | | | This is basically just a copy/paste of GROBID code, only simpler! | ||||
* | fix persist bug where ingest_request_source not saved | Bryan Newbold | 2020-02-05 | 1 | -0/+1 |
| | |||||
* | persist grobid: actually, status_code is required | Bryan Newbold | 2020-01-21 | 1 | -2/+9 |
| | | | | | | | Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank. | ||||
* | persist: work around GROBID timeouts with no status_code | Bryan Newbold | 2020-01-21 | 1 | -2/+2 |
| | |||||
* | persist worker: implement updated ingest result semantics | Bryan Newbold | 2020-01-15 | 1 | -11/+16 |
| | |||||
* | ingest persist skips 'existing' ingest results | Bryan Newbold | 2020-01-14 | 1 | -0/+3 |
| | |||||
* | handle grobid2json errors in calling code instead | Bryan Newbold | 2020-01-02 | 1 | -1/+7 |
| | |||||
* | db: move duplicate row filtering into DB insert helpers | Bryan Newbold | 2020-01-02 | 1 | -15/+1 |
| | |||||
* | remove unused filter in grobid worker | Bryan Newbold | 2020-01-02 | 1 | -1/+0 |
| | |||||
* | fix dict typo | Bryan Newbold | 2020-01-02 | 1 | -1/+1 |
| | |||||
* | improvements to grobid persist worker | Bryan Newbold | 2020-01-02 | 1 | -13/+16 |
| | |||||
* | filter ingest results to not have key conflicts within batch | Bryan Newbold | 2020-01-02 | 1 | -1/+16 |
| | | | | | This handles a corner case with ON CONFLICT ... DO UPDATE where you can't do multiple such updates in the same batch transaction. | ||||
* | db: fancy insert/update separation using postgres xmax | Bryan Newbold | 2020-01-02 | 1 | -9/+15 |
| | |||||
* | add PersistGrobidDiskWorker | Bryan Newbold | 2020-01-02 | 1 | -0/+33 |
| | | | | To help with making dumps directly from Kafka (eg, for partner delivery) | ||||
* | flush out minio helper, add to grobid persist | Bryan Newbold | 2020-01-02 | 1 | -9/+29 |
| | |||||
* | implement counts properly for persist workers | Bryan Newbold | 2020-01-02 | 1 | -15/+19 |
| | |||||
* | start work on persist workers and tool | Bryan Newbold | 2020-01-02 | 1 | -0/+223 |