aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler_worker.py
Commit message (Collapse)AuthorAgeFilesLines
* python: isort all importsBryan Newbold2021-10-261-3/+4
|
* tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-301-0/+4
|
* new 'daily' and 'priority' ingest request topicsBryan Newbold2021-09-301-1/+7
| | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode.
* html: actually publish HTML TEI-XML to body; fix dataflow though ingest a bitBryan Newbold2020-11-041-0/+6
|
* small fixes from local testing for XML ingestBryan Newbold2020-11-031-2/+0
|
* persist: XML and HTML persist workersBryan Newbold2020-11-031-0/+47
|
* ingest: handle publishing XML docs to kafkaBryan Newbold2020-11-031-0/+6
|
* refactor 'minio' to 'seaweedfs'; and BLOB env varsBryan Newbold2020-11-031-10/+10
| | | | | This goes along with changes to ansible deployment to use the correct key names and values.
* better default CLI output (show usage)Bryan Newbold2020-10-291-1/+1
|
* persist PDF extraction in ingest pipelineBryan Newbold2020-10-201-4/+16
| | | | | Ooof, didn't realize that this wasn't happening. Explains a lot of missing thumbnails in scholar!
* customize timeout per worker; 120sec for pdf-extractBryan Newbold2020-06-291-0/+1
| | | | | This is a stab-in-the-dark attempt to resolve long timeouts with this worker in prod.
* args.kafka_env refactor didn't happen (yet)Bryan Newbold2020-06-251-2/+2
|
* s3-only mode persist workers use different consumer groupBryan Newbold2020-06-251-2/+8
|
* sandcrawler_worker: remove duplicate run_pdf_extract()Bryan Newbold2020-06-251-29/+0
|
* pdfextract workerBryan Newbold2020-06-251-1/+34
|
* fixes and tweaks from testing locallyBryan Newbold2020-06-171-1/+2
|
* tweak kafka topic names and seaweedfs layoutBryan Newbold2020-06-171-9/+10
|
* add new pdf workers/persistersBryan Newbold2020-06-171-0/+83
|
* skip-db option also for workerBryan Newbold2020-03-191-0/+4
|
* ingest: bulk workers don't hit SPNv2Bryan Newbold2020-02-131-0/+2
|
* pdftrio basic python codeBryan Newbold2020-02-121-0/+19
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* sandcrawler_worker: ingest worker distinct consumer groupsBryan Newbold2020-01-291-1/+3
| | | | | | I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format.
* make grobid-extract worker batch size 1Bryan Newbold2020-01-281-0/+1
| | | | | This is part of attempts to fix Kafka errors that look like they might be timeouts.
* improve sentry reporting with 'release' git hashBryan Newbold2020-01-151-1/+5
|
* bulk ingest file request topic supportBryan Newbold2020-01-141-1/+7
|
* grobid-to-kafka support in ingest workerBryan Newbold2020-01-141-0/+6
|
* update persist worker invocation to use batchesBryan Newbold2020-01-021-15/+55
|
* fix sandcrawler persist workersBryan Newbold2020-01-021-8/+36
|
* start work on persist workers and toolBryan Newbold2020-01-021-5/+15
|
* refactor: improve argparse usageBryan Newbold2019-12-181-4/+8
| | | | | use ArgumentDefaultsHelpFormatter and add help messages to all sub-commands
* update ingest-file batch size to 1Bryan Newbold2019-11-141-1/+1
| | | | | | | | Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
* fix lint errorsBryan Newbold2019-11-131-5/+10
|
* correct ingest-file consumer groupBryan Newbold2019-11-131-1/+1
|
* add basic sandcrawler worker (kafka)Bryan Newbold2019-11-131-0/+74