index
:
sandcrawler
bnewbold-args
bnewbold-backfill
bnewbold-persist-grobid-errors
bnewbold-refactor-loggging
master
trawler
[no description]
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
python
/
sandcrawler_worker.py
Commit message (
Expand
)
Author
Age
Files
Lines
*
more sentry config changes
Bryan Newbold
2022-02-25
1
-1
/
+1
*
switch from 'raven' to 'sentry-sdk'
Bryan Newbold
2022-02-24
1
-8
/
+12
*
sandcrawler_worker: add --skip-spn flag
Bryan Newbold
2022-02-08
1
-2
/
+7
*
worker: add kafka_group_suffix option
Bryan Newbold
2021-12-07
1
-3
/
+19
*
crossref persist: batch size depends on whether parsing refs
Bryan Newbold
2021-11-04
1
-1
/
+4
*
crossref persist: make GROBID ref parsing an option (not default)
Bryan Newbold
2021-11-04
1
-2
/
+11
*
glue, utils, and worker code for crossref and grobid_refs
Bryan Newbold
2021-11-04
1
-2
/
+31
*
make fmt (black 21.9b0)
Bryan Newbold
2021-10-27
1
-76
/
+102
*
start handling trivial lint cleanups: unused imports, 'is None', etc
Bryan Newbold
2021-10-26
1
-2
/
+1
*
make fmt
Bryan Newbold
2021-10-26
1
-64
/
+87
*
python: isort all imports
Bryan Newbold
2021-10-26
1
-3
/
+4
*
tune SPN CDX retry/wait depending on mode (priority vs daily)
Bryan Newbold
2021-09-30
1
-0
/
+4
*
new 'daily' and 'priority' ingest request topics
Bryan Newbold
2021-09-30
1
-1
/
+7
*
html: actually publish HTML TEI-XML to body; fix dataflow though ingest a bit
Bryan Newbold
2020-11-04
1
-0
/
+6
*
small fixes from local testing for XML ingest
Bryan Newbold
2020-11-03
1
-2
/
+0
*
persist: XML and HTML persist workers
Bryan Newbold
2020-11-03
1
-0
/
+47
*
ingest: handle publishing XML docs to kafka
Bryan Newbold
2020-11-03
1
-0
/
+6
*
refactor 'minio' to 'seaweedfs'; and BLOB env vars
Bryan Newbold
2020-11-03
1
-10
/
+10
*
better default CLI output (show usage)
Bryan Newbold
2020-10-29
1
-1
/
+1
*
persist PDF extraction in ingest pipeline
Bryan Newbold
2020-10-20
1
-4
/
+16
*
customize timeout per worker; 120sec for pdf-extract
Bryan Newbold
2020-06-29
1
-0
/
+1
*
args.kafka_env refactor didn't happen (yet)
Bryan Newbold
2020-06-25
1
-2
/
+2
*
s3-only mode persist workers use different consumer group
Bryan Newbold
2020-06-25
1
-2
/
+8
*
sandcrawler_worker: remove duplicate run_pdf_extract()
Bryan Newbold
2020-06-25
1
-29
/
+0
*
pdfextract worker
Bryan Newbold
2020-06-25
1
-1
/
+34
*
fixes and tweaks from testing locally
Bryan Newbold
2020-06-17
1
-1
/
+2
*
tweak kafka topic names and seaweedfs layout
Bryan Newbold
2020-06-17
1
-9
/
+10
*
add new pdf workers/persisters
Bryan Newbold
2020-06-17
1
-0
/
+83
*
skip-db option also for worker
Bryan Newbold
2020-03-19
1
-0
/
+4
*
ingest: bulk workers don't hit SPNv2
Bryan Newbold
2020-02-13
1
-0
/
+2
*
pdftrio basic python code
Bryan Newbold
2020-02-12
1
-0
/
+19
*
sandcrawler_worker: ingest worker distinct consumer groups
Bryan Newbold
2020-01-29
1
-1
/
+3
*
make grobid-extract worker batch size 1
Bryan Newbold
2020-01-28
1
-0
/
+1
*
improve sentry reporting with 'release' git hash
Bryan Newbold
2020-01-15
1
-1
/
+5
*
bulk ingest file request topic support
Bryan Newbold
2020-01-14
1
-1
/
+7
*
grobid-to-kafka support in ingest worker
Bryan Newbold
2020-01-14
1
-0
/
+6
*
update persist worker invocation to use batches
Bryan Newbold
2020-01-02
1
-15
/
+55
*
fix sandcrawler persist workers
Bryan Newbold
2020-01-02
1
-8
/
+36
*
start work on persist workers and tool
Bryan Newbold
2020-01-02
1
-5
/
+15
*
refactor: improve argparse usage
Bryan Newbold
2019-12-18
1
-4
/
+8
*
update ingest-file batch size to 1
Bryan Newbold
2019-11-14
1
-1
/
+1
*
fix lint errors
Bryan Newbold
2019-11-13
1
-5
/
+10
*
correct ingest-file consumer group
Bryan Newbold
2019-11-13
1
-1
/
+1
*
add basic sandcrawler worker (kafka)
Bryan Newbold
2019-11-13
1
-0
/
+74