Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | simple persist worker/tool to backfill grobid_refs | Bryan Newbold | 2021-11-10 | 1 | -0/+40 |
| | |||||
* | crossref grobid refs: another error case (ReadTimeout) | Bryan Newbold | 2021-11-04 | 1 | -1/+5 |
| | | | | | With this last exception handled, was about to get through millions of rows of references, with only a few dozen errors (mostly invalid XML). | ||||
* | grobid crossref refs: try to handle HTTP 5xx and XML parse errors | Bryan Newbold | 2021-11-04 | 1 | -1/+9 |
| | |||||
* | crossref persist: make GROBID ref parsing an option (not default) | Bryan Newbold | 2021-11-04 | 1 | -7/+16 |
| | |||||
* | glue, utils, and worker code for crossref and grobid_refs | Bryan Newbold | 2021-11-04 | 1 | -0/+45 |
| | |||||
* | remove grobid2json helper file, replace with grobid_tei_xml | Bryan Newbold | 2021-10-27 | 1 | -1/+1 |
| | |||||
* | small type annotation things from additional packages | Bryan Newbold | 2021-10-27 | 1 | -2/+4 |
| | |||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 1 | -212/+238 |
| | |||||
* | lint collection membership (last lint for now) | Bryan Newbold | 2021-10-26 | 1 | -5/+5 |
| | |||||
* | type annotations for persist workers; required some work | Bryan Newbold | 2021-10-26 | 1 | -66/+59 |
| | | | | | Had to re-structure and filter things a bit, Should be better behavior, but might be some small changes. | ||||
* | start adding python type annotations to db and persist code | Bryan Newbold | 2021-10-26 | 1 | -2/+4 |
| | |||||
* | flake8 clean (with current settings) | Bryan Newbold | 2021-10-26 | 1 | -2/+2 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 1 | -27/+34 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -4/+4 |
| | |||||
* | persist support for ingest platform table, using existing persist worker | Bryan Newbold | 2021-10-15 | 1 | -1/+62 |
| | |||||
* | improve fileset ingest integration with file ingest | Bryan Newbold | 2021-10-15 | 1 | -1/+1 |
| | |||||
* | wrap up previous renaming work | Bryan Newbold | 2021-10-15 | 1 | -1/+1 |
| | |||||
* | persist: skip very long URLs | Bryan Newbold | 2021-04-12 | 1 | -0/+4 |
| | |||||
* | persist: html_meta is ON CONFLICT DO UPDATE | Bryan Newbold | 2020-12-15 | 1 | -1/+1 |
| | |||||
* | persist: don't expect HTML TEI-XML in result object | Bryan Newbold | 2020-12-15 | 1 | -1/+1 |
| | |||||
* | html: refactors/tweaks from testing | Bryan Newbold | 2020-11-06 | 1 | -1/+0 |
| | |||||
* | persist: fix worker API/typing hacks (raw_key, key, key_str) | Bryan Newbold | 2020-11-04 | 1 | -9/+9 |
| | |||||
* | initial implementation of HTML ingest in existing worker | Bryan Newbold | 2020-11-04 | 1 | -3/+17 |
| | |||||
* | small fixes from local testing for XML ingest | Bryan Newbold | 2020-11-03 | 1 | -3/+8 |
| | |||||
* | persist: XML and HTML persist workers | Bryan Newbold | 2020-11-03 | 1 | -3/+74 |
| | |||||
* | refactor 'minio' to 'seaweedfs'; and BLOB env vars | Bryan Newbold | 2020-11-03 | 1 | -2/+4 |
| | | | | | This goes along with changes to ansible deployment to use the correct key names and values. | ||||
* | changes from prod | Bryan Newbold | 2020-06-25 | 1 | -4/+12 |
| | |||||
* | fixes and tweaks from testing locally | Bryan Newbold | 2020-06-17 | 1 | -11/+18 |
| | |||||
* | tweak kafka topic names and seaweedfs layout | Bryan Newbold | 2020-06-17 | 1 | -1/+2 |
| | |||||
* | add new pdf workers/persisters | Bryan Newbold | 2020-06-17 | 1 | -0/+99 |
| | |||||
* | workers: refactor to pass key to process() | Bryan Newbold | 2020-06-17 | 1 | -6/+6 |
| | |||||
* | persist: only GROBID updates file_meta, not file-result | Bryan Newbold | 2020-04-16 | 1 | -1/+1 |
| | | | | | | | | | The hope here is to reduce deadlocks in production (on aitio). As context, we are only doing "updates" until the entire file_meta table is filled in with full metadata anyways; updates are wasteful of resources, and most inserts we have seen the file before, so should be doing "DO NOTHING" if the SHA1 is already in the table. | ||||
* | persist grobid: add option to skip S3 upload | Bryan Newbold | 2020-03-19 | 1 | -7/+10 |
| | | | | | | | Motivation for this is that current S3 target (minio) is overloaded, with too many files on a single partition (80 million+). Going to look in to seaweedfs and other options, but for now stopping minio persist. Data is all stored in kafka anyways. | ||||
* | fixes to ingest-request persist | Bryan Newbold | 2020-03-05 | 1 | -3/+1 |
| | |||||
* | persist: ingest_request tool (with no ingest_file_result) | Bryan Newbold | 2020-03-05 | 1 | -0/+29 |
| | |||||
* | pdf_trio persist fixes from prod | Bryan Newbold | 2020-02-19 | 1 | -1/+5 |
| | |||||
* | include rel and oa_status in ingest request 'extra' | Bryan Newbold | 2020-02-18 | 1 | -1/+1 |
| | |||||
* | move pdf_trio results back under key in JSON/Kafka | Bryan Newbold | 2020-02-13 | 1 | -1/+9 |
| | |||||
* | pdftrio basic python code | Bryan Newbold | 2020-02-12 | 1 | -0/+21 |
| | | | | This is basically just a copy/paste of GROBID code, only simpler! | ||||
* | fix persist bug where ingest_request_source not saved | Bryan Newbold | 2020-02-05 | 1 | -0/+1 |
| | |||||
* | persist grobid: actually, status_code is required | Bryan Newbold | 2020-01-21 | 1 | -2/+9 |
| | | | | | | | Instead of working around when missing, force it to exist but skip in database insert section. Disk mode still needs to check if blank. | ||||
* | persist: work around GROBID timeouts with no status_code | Bryan Newbold | 2020-01-21 | 1 | -2/+2 |
| | |||||
* | persist worker: implement updated ingest result semantics | Bryan Newbold | 2020-01-15 | 1 | -11/+16 |
| | |||||
* | ingest persist skips 'existing' ingest results | Bryan Newbold | 2020-01-14 | 1 | -0/+3 |
| | |||||
* | handle grobid2json errors in calling code instead | Bryan Newbold | 2020-01-02 | 1 | -1/+7 |
| | |||||
* | db: move duplicate row filtering into DB insert helpers | Bryan Newbold | 2020-01-02 | 1 | -15/+1 |
| | |||||
* | remove unused filter in grobid worker | Bryan Newbold | 2020-01-02 | 1 | -1/+0 |
| | |||||
* | fix dict typo | Bryan Newbold | 2020-01-02 | 1 | -1/+1 |
| | |||||
* | improvements to grobid persist worker | Bryan Newbold | 2020-01-02 | 1 | -13/+16 |
| | |||||
* | filter ingest results to not have key conflicts within batch | Bryan Newbold | 2020-01-02 | 1 | -1/+16 |
| | | | | | This handles a corner case with ON CONFLICT ... DO UPDATE where you can't do multiple such updates in the same batch transaction. |