aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/workers
Commit message (Collapse)AuthorAgeFilesLines
* ingest: always try some lancet journalsBryan Newbold2020-03-191-0/+3
|
* entity worker: ingest more releasesBryan Newbold2020-02-221-1/+37
| | | | | | | | If release is a dataset or image, don't do a pdf ingest request. If release is a datacite DOI, and release_type is a "document", crawl regardless of is_oa detection. This is mostly to crawl repositories (institutional or subject).
* always crawl researchgate DOIsBryan Newbold2020-02-181-0/+2
| | | | Now that ingest is fixed
* add acceptlist override for biorxiv/medrxivBryan Newbold2020-02-101-2/+12
|
* fix KafkaError worker reporting for partition errorsBryan Newbold2020-01-292-2/+2
|
* additional DOI prefix filtersBryan Newbold2020-01-281-0/+8
| | | | From martin, thanks.
* apply ingest request filtering in entity workerBryan Newbold2020-01-281-3/+34
| | | | | | | `ingest_oa_only` behavior, and other filters, now handled in the entity update worker, instead of in the transform function. Also add a DOI prefix blocklist feature.
* update ingest request schemaBryan Newbold2019-12-131-1/+1
| | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
* project -> ingest_request_sourceBryan Newbold2019-11-151-1/+1
|
* add ingest request feature to entity_updates workerBryan Newbold2019-11-151-4/+20
| | | | | | | | | | | | | Initially was going to create a new worker to consume from the release update channel, but couldn't get the edit context ("is this a new release, or update to an existing") from that context. Currently there is a flag in source code to control whether we only do OA releases or all releases. Starting with OA only to start slow, but should probably default to all, and make this a config flag. Should probably also have a config flag to control this entire feature. Tested locally in dev.
* review/fix all confluent-kafka produce codeBryan Newbold2019-09-202-12/+26
|
* small fixes to confluent-kafka importers/workersBryan Newbold2019-09-203-12/+41
| | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
* convert pipeline workers from pykafka to confluent-kafkaBryan Newbold2019-09-203-125/+230
|
* refactor all python source for client lib nameBryan Newbold2019-09-052-3/+3
|
* start new ES container worker kafka groupBryan Newbold2019-07-311-0/+2
| | | | | | | | The previous group seems to have gotten corrupted; my hypothesis is that this is due to pykafka being somewhat flakey, and am planning to move to librdkafka anyways. Re-indexing all the containers is pretty small/easy, so starting a new consumer group works find in this case; release indexer would be a bigger problem.
* fix typo in typoBryan Newbold2019-06-241-1/+1
|
* fix typo in changelog workerBryan Newbold2019-06-241-1/+1
|
* more links on new homepageBryan Newbold2019-06-192-2/+2
| | | | | matching produce sizes. may want to tweak this config in the future for throughput.
* fix and workaround container entities in release topicBryan Newbold2019-05-302-2/+8
|
* fix syntax bugs (container elastic worker)Bryan Newbold2019-05-301-5/+5
|
* add container update elastic workerBryan Newbold2019-05-302-6/+26
|
* file and container update kafka topicsBryan Newbold2019-05-301-54/+69
|
* update elastic for releases when files addedBryan Newbold2019-05-301-1/+36
| | | | A bunch of remaining TODOs here
* 10 MByte default Kafka produce (workers)Bryan Newbold2019-03-062-2/+9
|
* elastic-release worker w/o APIBryan Newbold2019-03-041-4/+4
| | | | | Forgot that this worker really doesn't want/need any API connection at all; just an ApiClient to deserialize objects from Kafka.
* fix elastic research worker api argBryan Newbold2019-03-041-4/+3
|
* bunch of lint/whitespace cleanupsBryan Newbold2019-02-222-4/+3
|
* fatcat -> fatcat_release ES indexBryan Newbold2019-01-281-2/+3
|
* include filesets and webcaptures in exportsBryan Newbold2019-01-181-1/+1
|
* Merge branch 'bnewbold-crude-auth'Bryan Newbold2019-01-082-9/+7
|\ | | | | | | | | Fixed a conflict in: python/fatcat_export.py
| * workers do API-passing (not URI-passing)Bryan Newbold2019-01-082-9/+7
| |
* | check request status codes idiomaticallyBryan Newbold2018-12-291-1/+1
|/
* not as strong a todo (timestamps)Bryan Newbold2018-11-191-1/+1
|
* bunch of pylint cleanupBryan Newbold2018-11-151-1/+1
|
* large refactor of python names/pathsBryan Newbold2018-11-153-17/+22
| | | | | | | - Add __init__.py files for fatcat_tools submodules, and use them in imports - Add a bunch of comments to files. - rename a number of classes and functions to be less verbose
* have recent message helper cleanup consumerBryan Newbold2018-11-151-1/+5
|
* fix worker codeBryan Newbold2018-11-142-2/+5
|
* most_recent_message as reusable functionBryan Newbold2018-11-142-26/+26
|
* switch to auto consumer offset updatesBryan Newbold2018-11-132-2/+11
| | | | | | This is the classic/correct way to do consumer group updates for higher throughput, when "at least once" semantics are acceptible (as they are here; double processing should be safe/fine).
* to_elastic_dict -> release_elastic_dictBryan Newbold2018-11-131-1/+2
|
* more simple fatcat_client importsBryan Newbold2018-11-131-1/+1
|
* shuffle around fatcat_tools layoutBryan Newbold2018-11-133-0/+194