aboutsummaryrefslogtreecommitdiffstats
path: root/python/persist_tool.py
Commit message (Collapse)AuthorAgeFilesLines
* simple persist worker/tool to backfill grobid_refsBryan Newbold2021-11-101-0/+22
|
* crossref persist: batch size depends on whether parsing refsBryan Newbold2021-11-041-1/+4
|
* crossref persist: make GROBID ref parsing an option (not default)Bryan Newbold2021-11-041-0/+6
|
* glue, utils, and worker code for crossref and grobid_refsBryan Newbold2021-11-041-0/+30
|
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-69/+109
|
* make fmtBryan Newbold2021-10-261-63/+62
|
* python: isort all importsBryan Newbold2021-10-261-1/+1
|
* refactor 'minio' to 'seaweedfs'; and BLOB env varsBryan Newbold2020-11-031-9/+9
| | | | | This goes along with changes to ansible deployment to use the correct key names and values.
* lint fixesBryan Newbold2020-06-171-2/+1
|
* add new pdf workers/persistersBryan Newbold2020-06-171-0/+30
|
* persist grobid: add option to skip S3 uploadBryan Newbold2020-03-191-0/+4
| | | | | | | Motivation for this is that current S3 target (minio) is overloaded, with too many files on a single partition (80 million+). Going to look in to seaweedfs and other options, but for now stopping minio persist. Data is all stored in kafka anyways.
* fixes to ingest-request persistBryan Newbold2020-03-051-1/+1
|
* persist: ingest_request tool (with no ingest_file_result)Bryan Newbold2020-03-051-0/+18
|
* pdftrio basic python codeBryan Newbold2020-02-121-0/+18
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* improve sentry reporting with 'release' git hashBryan Newbold2020-01-151-1/+0
|
* more ftp status 226 supportBryan Newbold2020-01-141-1/+1
|
* add PersistGrobidDiskWorkerBryan Newbold2020-01-021-0/+27
| | | | To help with making dumps directly from Kafka (eg, for partner delivery)
* flush out minio helper, add to grobid persistBryan Newbold2020-01-021-2/+20
|
* start work on persist workers and toolBryan Newbold2020-01-021-0/+98