aboutsummaryrefslogtreecommitdiffstats
path: root/python/persist_tool.py
Commit message (Collapse)AuthorAgeFilesLines
* lint fixesBryan Newbold2020-06-171-2/+1
|
* add new pdf workers/persistersBryan Newbold2020-06-171-0/+30
|
* persist grobid: add option to skip S3 uploadBryan Newbold2020-03-191-0/+4
| | | | | | | Motivation for this is that current S3 target (minio) is overloaded, with too many files on a single partition (80 million+). Going to look in to seaweedfs and other options, but for now stopping minio persist. Data is all stored in kafka anyways.
* fixes to ingest-request persistBryan Newbold2020-03-051-1/+1
|
* persist: ingest_request tool (with no ingest_file_result)Bryan Newbold2020-03-051-0/+18
|
* pdftrio basic python codeBryan Newbold2020-02-121-0/+18
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* improve sentry reporting with 'release' git hashBryan Newbold2020-01-151-1/+0
|
* more ftp status 226 supportBryan Newbold2020-01-141-1/+1
|
* add PersistGrobidDiskWorkerBryan Newbold2020-01-021-0/+27
| | | | To help with making dumps directly from Kafka (eg, for partner delivery)
* flush out minio helper, add to grobid persistBryan Newbold2020-01-021-2/+20
|
* start work on persist workers and toolBryan Newbold2020-01-021-0/+98