Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | ingest: more DOI patterns to treat as OA | Bryan Newbold | 2020-03-28 | 1 | -0/+26 |
| | | | | | | | These are journal/publisher patterns which we suspect to actually be OA based on the large quantity of papers that crawl successfully. The better long-term solution will be to flag containers in some way as OA (or "should crawl"), but this is a good short-term solution. | ||||
* | ingest: always try some lancet journals | Bryan Newbold | 2020-03-19 | 1 | -0/+3 |
| | |||||
* | entity worker: ingest more releases | Bryan Newbold | 2020-02-22 | 1 | -1/+37 |
| | | | | | | | | If release is a dataset or image, don't do a pdf ingest request. If release is a datacite DOI, and release_type is a "document", crawl regardless of is_oa detection. This is mostly to crawl repositories (institutional or subject). | ||||
* | always crawl researchgate DOIs | Bryan Newbold | 2020-02-18 | 1 | -0/+2 |
| | | | | Now that ingest is fixed | ||||
* | add acceptlist override for biorxiv/medrxiv | Bryan Newbold | 2020-02-10 | 1 | -2/+12 |
| | |||||
* | fix KafkaError worker reporting for partition errors | Bryan Newbold | 2020-01-29 | 2 | -2/+2 |
| | |||||
* | additional DOI prefix filters | Bryan Newbold | 2020-01-28 | 1 | -0/+8 |
| | | | | From martin, thanks. | ||||
* | apply ingest request filtering in entity worker | Bryan Newbold | 2020-01-28 | 1 | -3/+34 |
| | | | | | | | `ingest_oa_only` behavior, and other filters, now handled in the entity update worker, instead of in the transform function. Also add a DOI prefix blocklist feature. | ||||
* | update ingest request schema | Bryan Newbold | 2019-12-13 | 1 | -1/+1 |
| | | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups. | ||||
* | project -> ingest_request_source | Bryan Newbold | 2019-11-15 | 1 | -1/+1 |
| | |||||
* | add ingest request feature to entity_updates worker | Bryan Newbold | 2019-11-15 | 1 | -4/+20 |
| | | | | | | | | | | | | | Initially was going to create a new worker to consume from the release update channel, but couldn't get the edit context ("is this a new release, or update to an existing") from that context. Currently there is a flag in source code to control whether we only do OA releases or all releases. Starting with OA only to start slow, but should probably default to all, and make this a config flag. Should probably also have a config flag to control this entire feature. Tested locally in dev. | ||||
* | review/fix all confluent-kafka produce code | Bryan Newbold | 2019-09-20 | 2 | -12/+26 |
| | |||||
* | small fixes to confluent-kafka importers/workers | Bryan Newbold | 2019-09-20 | 3 | -12/+41 |
| | | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again) | ||||
* | convert pipeline workers from pykafka to confluent-kafka | Bryan Newbold | 2019-09-20 | 3 | -125/+230 |
| | |||||
* | refactor all python source for client lib name | Bryan Newbold | 2019-09-05 | 2 | -3/+3 |
| | |||||
* | start new ES container worker kafka group | Bryan Newbold | 2019-07-31 | 1 | -0/+2 |
| | | | | | | | | The previous group seems to have gotten corrupted; my hypothesis is that this is due to pykafka being somewhat flakey, and am planning to move to librdkafka anyways. Re-indexing all the containers is pretty small/easy, so starting a new consumer group works find in this case; release indexer would be a bigger problem. | ||||
* | fix typo in typo | Bryan Newbold | 2019-06-24 | 1 | -1/+1 |
| | |||||
* | fix typo in changelog worker | Bryan Newbold | 2019-06-24 | 1 | -1/+1 |
| | |||||
* | more links on new homepage | Bryan Newbold | 2019-06-19 | 2 | -2/+2 |
| | | | | | matching produce sizes. may want to tweak this config in the future for throughput. | ||||
* | fix and workaround container entities in release topic | Bryan Newbold | 2019-05-30 | 2 | -2/+8 |
| | |||||
* | fix syntax bugs (container elastic worker) | Bryan Newbold | 2019-05-30 | 1 | -5/+5 |
| | |||||
* | add container update elastic worker | Bryan Newbold | 2019-05-30 | 2 | -6/+26 |
| | |||||
* | file and container update kafka topics | Bryan Newbold | 2019-05-30 | 1 | -54/+69 |
| | |||||
* | update elastic for releases when files added | Bryan Newbold | 2019-05-30 | 1 | -1/+36 |
| | | | | A bunch of remaining TODOs here | ||||
* | 10 MByte default Kafka produce (workers) | Bryan Newbold | 2019-03-06 | 2 | -2/+9 |
| | |||||
* | elastic-release worker w/o API | Bryan Newbold | 2019-03-04 | 1 | -4/+4 |
| | | | | | Forgot that this worker really doesn't want/need any API connection at all; just an ApiClient to deserialize objects from Kafka. | ||||
* | fix elastic research worker api arg | Bryan Newbold | 2019-03-04 | 1 | -4/+3 |
| | |||||
* | bunch of lint/whitespace cleanups | Bryan Newbold | 2019-02-22 | 2 | -4/+3 |
| | |||||
* | fatcat -> fatcat_release ES index | Bryan Newbold | 2019-01-28 | 1 | -2/+3 |
| | |||||
* | include filesets and webcaptures in exports | Bryan Newbold | 2019-01-18 | 1 | -1/+1 |
| | |||||
* | Merge branch 'bnewbold-crude-auth' | Bryan Newbold | 2019-01-08 | 2 | -9/+7 |
|\ | | | | | | | | | Fixed a conflict in: python/fatcat_export.py | ||||
| * | workers do API-passing (not URI-passing) | Bryan Newbold | 2019-01-08 | 2 | -9/+7 |
| | | |||||
* | | check request status codes idiomatically | Bryan Newbold | 2018-12-29 | 1 | -1/+1 |
|/ | |||||
* | not as strong a todo (timestamps) | Bryan Newbold | 2018-11-19 | 1 | -1/+1 |
| | |||||
* | bunch of pylint cleanup | Bryan Newbold | 2018-11-15 | 1 | -1/+1 |
| | |||||
* | large refactor of python names/paths | Bryan Newbold | 2018-11-15 | 3 | -17/+22 |
| | | | | | | | - Add __init__.py files for fatcat_tools submodules, and use them in imports - Add a bunch of comments to files. - rename a number of classes and functions to be less verbose | ||||
* | have recent message helper cleanup consumer | Bryan Newbold | 2018-11-15 | 1 | -1/+5 |
| | |||||
* | fix worker code | Bryan Newbold | 2018-11-14 | 2 | -2/+5 |
| | |||||
* | most_recent_message as reusable function | Bryan Newbold | 2018-11-14 | 2 | -26/+26 |
| | |||||
* | switch to auto consumer offset updates | Bryan Newbold | 2018-11-13 | 2 | -2/+11 |
| | | | | | | This is the classic/correct way to do consumer group updates for higher throughput, when "at least once" semantics are acceptible (as they are here; double processing should be safe/fine). | ||||
* | to_elastic_dict -> release_elastic_dict | Bryan Newbold | 2018-11-13 | 1 | -1/+2 |
| | |||||
* | more simple fatcat_client imports | Bryan Newbold | 2018-11-13 | 1 | -1/+1 |
| | |||||
* | shuffle around fatcat_tools layout | Bryan Newbold | 2018-11-13 | 3 | -0/+194 |