Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | crossref persist: make GROBID ref parsing an option (not default) | Bryan Newbold | 2021-11-04 | 1 | -2/+11 |
| | |||||
* | glue, utils, and worker code for crossref and grobid_refs | Bryan Newbold | 2021-11-04 | 1 | -2/+31 |
| | |||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 1 | -76/+102 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 1 | -2/+1 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 1 | -64/+87 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -3/+4 |
| | |||||
* | tune SPN CDX retry/wait depending on mode (priority vs daily) | Bryan Newbold | 2021-09-30 | 1 | -0/+4 |
| | |||||
* | new 'daily' and 'priority' ingest request topics | Bryan Newbold | 2021-09-30 | 1 | -1/+7 |
| | | | | | | | | | The old ingest request queue was always getting lopsided, suspect because it was scaled up (additional partitions) at some point in the past, hoping new topics will fix this. New '-priority' queue is like '-bulk', but for smaller-volume SPN-like requests. Eg, interactive mode. | ||||
* | html: actually publish HTML TEI-XML to body; fix dataflow though ingest a bit | Bryan Newbold | 2020-11-04 | 1 | -0/+6 |
| | |||||
* | small fixes from local testing for XML ingest | Bryan Newbold | 2020-11-03 | 1 | -2/+0 |
| | |||||
* | persist: XML and HTML persist workers | Bryan Newbold | 2020-11-03 | 1 | -0/+47 |
| | |||||
* | ingest: handle publishing XML docs to kafka | Bryan Newbold | 2020-11-03 | 1 | -0/+6 |
| | |||||
* | refactor 'minio' to 'seaweedfs'; and BLOB env vars | Bryan Newbold | 2020-11-03 | 1 | -10/+10 |
| | | | | | This goes along with changes to ansible deployment to use the correct key names and values. | ||||
* | better default CLI output (show usage) | Bryan Newbold | 2020-10-29 | 1 | -1/+1 |
| | |||||
* | persist PDF extraction in ingest pipeline | Bryan Newbold | 2020-10-20 | 1 | -4/+16 |
| | | | | | Ooof, didn't realize that this wasn't happening. Explains a lot of missing thumbnails in scholar! | ||||
* | customize timeout per worker; 120sec for pdf-extract | Bryan Newbold | 2020-06-29 | 1 | -0/+1 |
| | | | | | This is a stab-in-the-dark attempt to resolve long timeouts with this worker in prod. | ||||
* | args.kafka_env refactor didn't happen (yet) | Bryan Newbold | 2020-06-25 | 1 | -2/+2 |
| | |||||
* | s3-only mode persist workers use different consumer group | Bryan Newbold | 2020-06-25 | 1 | -2/+8 |
| | |||||
* | sandcrawler_worker: remove duplicate run_pdf_extract() | Bryan Newbold | 2020-06-25 | 1 | -29/+0 |
| | |||||
* | pdfextract worker | Bryan Newbold | 2020-06-25 | 1 | -1/+34 |
| | |||||
* | fixes and tweaks from testing locally | Bryan Newbold | 2020-06-17 | 1 | -1/+2 |
| | |||||
* | tweak kafka topic names and seaweedfs layout | Bryan Newbold | 2020-06-17 | 1 | -9/+10 |
| | |||||
* | add new pdf workers/persisters | Bryan Newbold | 2020-06-17 | 1 | -0/+83 |
| | |||||
* | skip-db option also for worker | Bryan Newbold | 2020-03-19 | 1 | -0/+4 |
| | |||||
* | ingest: bulk workers don't hit SPNv2 | Bryan Newbold | 2020-02-13 | 1 | -0/+2 |
| | |||||
* | pdftrio basic python code | Bryan Newbold | 2020-02-12 | 1 | -0/+19 |
| | | | | This is basically just a copy/paste of GROBID code, only simpler! | ||||
* | sandcrawler_worker: ingest worker distinct consumer groups | Bryan Newbold | 2020-01-29 | 1 | -1/+3 |
| | | | | | | I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format. | ||||
* | make grobid-extract worker batch size 1 | Bryan Newbold | 2020-01-28 | 1 | -0/+1 |
| | | | | | This is part of attempts to fix Kafka errors that look like they might be timeouts. | ||||
* | improve sentry reporting with 'release' git hash | Bryan Newbold | 2020-01-15 | 1 | -1/+5 |
| | |||||
* | bulk ingest file request topic support | Bryan Newbold | 2020-01-14 | 1 | -1/+7 |
| | |||||
* | grobid-to-kafka support in ingest worker | Bryan Newbold | 2020-01-14 | 1 | -0/+6 |
| | |||||
* | update persist worker invocation to use batches | Bryan Newbold | 2020-01-02 | 1 | -15/+55 |
| | |||||
* | fix sandcrawler persist workers | Bryan Newbold | 2020-01-02 | 1 | -8/+36 |
| | |||||
* | start work on persist workers and tool | Bryan Newbold | 2020-01-02 | 1 | -5/+15 |
| | |||||
* | refactor: improve argparse usage | Bryan Newbold | 2019-12-18 | 1 | -4/+8 |
| | | | | | use ArgumentDefaultsHelpFormatter and add help messages to all sub-commands | ||||
* | update ingest-file batch size to 1 | Bryan Newbold | 2019-11-14 | 1 | -1/+1 |
| | | | | | | | | Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing. | ||||
* | fix lint errors | Bryan Newbold | 2019-11-13 | 1 | -5/+10 |
| | |||||
* | correct ingest-file consumer group | Bryan Newbold | 2019-11-13 | 1 | -1/+1 |
| | |||||
* | add basic sandcrawler worker (kafka) | Bryan Newbold | 2019-11-13 | 1 | -0/+74 |