aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler_worker.py
Commit message (Expand)AuthorAgeFilesLines
* more sentry config changesBryan Newbold2022-02-251-1/+1
* switch from 'raven' to 'sentry-sdk'Bryan Newbold2022-02-241-8/+12
* sandcrawler_worker: add --skip-spn flagBryan Newbold2022-02-081-2/+7
* worker: add kafka_group_suffix optionBryan Newbold2021-12-071-3/+19
* crossref persist: batch size depends on whether parsing refsBryan Newbold2021-11-041-1/+4
* crossref persist: make GROBID ref parsing an option (not default)Bryan Newbold2021-11-041-2/+11
* glue, utils, and worker code for crossref and grobid_refsBryan Newbold2021-11-041-2/+31
* make fmt (black 21.9b0)Bryan Newbold2021-10-271-76/+102
* start handling trivial lint cleanups: unused imports, 'is None', etcBryan Newbold2021-10-261-2/+1
* make fmtBryan Newbold2021-10-261-64/+87
* python: isort all importsBryan Newbold2021-10-261-3/+4
* tune SPN CDX retry/wait depending on mode (priority vs daily)Bryan Newbold2021-09-301-0/+4
* new 'daily' and 'priority' ingest request topicsBryan Newbold2021-09-301-1/+7
* html: actually publish HTML TEI-XML to body; fix dataflow though ingest a bitBryan Newbold2020-11-041-0/+6
* small fixes from local testing for XML ingestBryan Newbold2020-11-031-2/+0
* persist: XML and HTML persist workersBryan Newbold2020-11-031-0/+47
* ingest: handle publishing XML docs to kafkaBryan Newbold2020-11-031-0/+6
* refactor 'minio' to 'seaweedfs'; and BLOB env varsBryan Newbold2020-11-031-10/+10
* better default CLI output (show usage)Bryan Newbold2020-10-291-1/+1
* persist PDF extraction in ingest pipelineBryan Newbold2020-10-201-4/+16
* customize timeout per worker; 120sec for pdf-extractBryan Newbold2020-06-291-0/+1
* args.kafka_env refactor didn't happen (yet)Bryan Newbold2020-06-251-2/+2
* s3-only mode persist workers use different consumer groupBryan Newbold2020-06-251-2/+8
* sandcrawler_worker: remove duplicate run_pdf_extract()Bryan Newbold2020-06-251-29/+0
* pdfextract workerBryan Newbold2020-06-251-1/+34
* fixes and tweaks from testing locallyBryan Newbold2020-06-171-1/+2
* tweak kafka topic names and seaweedfs layoutBryan Newbold2020-06-171-9/+10
* add new pdf workers/persistersBryan Newbold2020-06-171-0/+83
* skip-db option also for workerBryan Newbold2020-03-191-0/+4
* ingest: bulk workers don't hit SPNv2Bryan Newbold2020-02-131-0/+2
* pdftrio basic python codeBryan Newbold2020-02-121-0/+19
* sandcrawler_worker: ingest worker distinct consumer groupsBryan Newbold2020-01-291-1/+3
* make grobid-extract worker batch size 1Bryan Newbold2020-01-281-0/+1
* improve sentry reporting with 'release' git hashBryan Newbold2020-01-151-1/+5
* bulk ingest file request topic supportBryan Newbold2020-01-141-1/+7
* grobid-to-kafka support in ingest workerBryan Newbold2020-01-141-0/+6
* update persist worker invocation to use batchesBryan Newbold2020-01-021-15/+55
* fix sandcrawler persist workersBryan Newbold2020-01-021-8/+36
* start work on persist workers and toolBryan Newbold2020-01-021-5/+15
* refactor: improve argparse usageBryan Newbold2019-12-181-4/+8
* update ingest-file batch size to 1Bryan Newbold2019-11-141-1/+1
* fix lint errorsBryan Newbold2019-11-131-5/+10
* correct ingest-file consumer groupBryan Newbold2019-11-131-1/+1
* add basic sandcrawler worker (kafka)Bryan Newbold2019-11-131-0/+74