Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | move container_status ES query code from fatcat_web to fatcat_tools | Bryan Newbold | 2022-02-09 | 1 | -2/+2 |
| | | | | | | The main motivation is to never have fatcat_tools import from fatcat_web, only vica-versa. Some code in fatcat_tools needs container stats, so starting with that code path (plus some generic helpers). | ||||
* | entity worker: expand creators in release entities | Bryan Newbold | 2021-12-15 | 1 | -1/+1 |
| | |||||
* | small default config typo fixes for elasticsearch workers | Bryan Newbold | 2021-12-15 | 1 | -2/+2 |
| | |||||
* | file elasticsearch index worker | Bryan Newbold | 2021-12-15 | 2 | -1/+35 |
| | |||||
* | typing: add assertions to fatcat_tool code to make type assumptions explicit | Bryan Newbold | 2021-11-03 | 1 | -0/+1 |
| | |||||
* | typing: add annotations to remaining fatcat_tools code | Bryan Newbold | 2021-11-03 | 3 | -51/+70 |
| | | | | | Again, these are just annotations, no changes made to get type checks to pass | ||||
* | re-fix some lint issues after big 'fmt' | Bryan Newbold | 2021-11-02 | 1 | -2/+3 |
| | |||||
* | fmt (black): fatcat_tools/ | Bryan Newbold | 2021-11-02 | 3 | -196/+263 |
| | |||||
* | python: isort everything | Bryan Newbold | 2021-11-02 | 1 | -1/+2 |
| | |||||
* | hacks to work around new pylint false positives | Bryan Newbold | 2021-11-02 | 1 | -2/+3 |
| | |||||
* | cleanup imports after fatcat_tools.transforms change | Bryan Newbold | 2021-11-02 | 1 | -5/+8 |
| | |||||
* | re-fmt all the fatcat_tools __init__ files for readability | Bryan Newbold | 2021-11-02 | 1 | -3/+6 |
| | |||||
* | changelog worker: fix file/fileset typo, caught by lint | Bryan Newbold | 2021-05-25 | 1 | -1/+1 |
| | | | | | This would have been resulting in some releases not getting re-indexed into search. | ||||
* | es worker: ensure kafka messages get cleared | Bryan Newbold | 2021-04-12 | 1 | -0/+2 |
| | |||||
* | es indexing: more 'wip' fixes | Bryan Newbold | 2021-04-12 | 1 | -1/+5 |
| | |||||
* | ES indexing: skip 'wip' entities with a warning | Bryan Newbold | 2021-04-12 | 1 | -11/+16 |
| | |||||
* | container ES index worker: support for querying status | Bryan Newbold | 2021-04-06 | 1 | -5/+32 |
| | |||||
* | indexing: don't use document names | Bryan Newbold | 2021-04-06 | 1 | -14/+4 |
| | |||||
* | entity update worker: treat fileset and webcapture updates like file updates | Bryan Newbold | 2020-12-16 | 1 | -3/+25 |
| | | | | | | | | | When webcapture or fileset entities are updated, then the release entities associated with them also need to be updated (and work entities, recursively). A TODO is to handle the case where a release_id is *removed* as well as *added*, and reprocess the releases in that case as well. | ||||
* | entity updates: don't ingest JSTOR DOI prefixes | Bryan Newbold | 2020-10-23 | 1 | -0/+2 |
| | |||||
* | entity updater: new work update feed (ident and changelog metadata only) | Bryan Newbold | 2020-10-16 | 1 | -2/+24 |
| | |||||
* | ingest: default to crawl protocols.io DOIs | Bryan Newbold | 2020-09-10 | 1 | -0/+2 |
| | |||||
* | entity updater: handle doi=None case better | Bryan Newbold | 2020-08-14 | 1 | -1/+1 |
| | |||||
* | entity updater: es['publisher_type'] not always set | Bryan Newbold | 2020-08-14 | 1 | -1/+1 |
| | | | | This is a small bugfix for a production issue. | ||||
* | entity update: change big5 ingest behavior | Bryan Newbold | 2020-08-11 | 1 | -9/+15 |
| | | | | | | | | | In addition to changing the OA default, this was the main intended behavior change in this group of commits: want to ingest fewer attempts that we *expect* to fail, but default to ingest/crawl attempt if we are uncertain. This is because there is a long tail of journals that register DOIs and are defacto OA (fulltext is available), but we don't have metadata indicating them as such. | ||||
* | entity update: default to ingest non-OA works | Bryan Newbold | 2020-08-11 | 1 | -9/+10 |
| | |||||
* | entity update: skip ingest of figshare+zenodo 'group' DOIs | Bryan Newbold | 2020-08-11 | 1 | -0/+15 |
| | |||||
* | update crawl blocklist for SPNv2 requests which mostly fail | Bryan Newbold | 2020-08-10 | 1 | -2/+10 |
| | |||||
* | lint (flake8) tool python files | Bryan Newbold | 2020-07-01 | 3 | -12/+0 |
| | |||||
* | more changelog ES fixes | Bryan Newbold | 2020-04-17 | 1 | -4/+6 |
| | |||||
* | ES changelog worker: fixes for ident; fetch update from API if needed | Bryan Newbold | 2020-04-17 | 1 | -2/+9 |
| | | | | | The API fetch update may be needed for old changelog entries in the kafka feed. | ||||
* | Merge branch 'martin-changelog-to-es' into 'master' | bnewbold | 2020-04-17 | 2 | -2/+23 |
|\ | | | | | | | | | derive changelog worker from release worker See merge request webgroup/fatcat!43 | ||||
| * | derive changelog worker from release worker | Martin Czygan | 2020-04-17 | 2 | -2/+23 |
| | | | | | | | | | | Early versions of changelog entries may not have all the fields required for the current transform. | ||||
* | | changelog: limit types | Martin Czygan | 2020-04-16 | 1 | -5/+1 |
| | | | | | | | | | | No partial docs (e.g. abstract), too generic components and entries, not HTML blogs. | ||||
* | | changelog: extend release_types considered documents | Martin Czygan | 2020-04-16 | 1 | -10/+19 |
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | according to release_rev.release_type, we have 29 values: fatcat_prod=# select release_type, count(release_type) from release_rev group by release_type; release_type | count -------------------+----------- abstract | 2264 article | 6371076 article-journal | 101083841 article-newspaper | 17062 book | 1676941 chapter | 13914854 component | 58990 dataset | 6860325 editorial | 133573 entry | 1628487 graphic | 1809471 interview | 19898 legal_case | 3581 legislation | 1626 letter | 275119 paper-conference | 6074669 peer_review | 30581 post | 245807 post-weblog | 135 report | 1010699 retraction | 1292 review-book | 96219 software | 316 song | 24027 speech | 4263 standard | 312364 stub | 1036813 thesis | 414397 | 0 (29 rows) | ||||
* | ingest: more DOI patterns to treat as OA | Bryan Newbold | 2020-03-28 | 1 | -0/+26 |
| | | | | | | | These are journal/publisher patterns which we suspect to actually be OA based on the large quantity of papers that crawl successfully. The better long-term solution will be to flag containers in some way as OA (or "should crawl"), but this is a good short-term solution. | ||||
* | ingest: always try some lancet journals | Bryan Newbold | 2020-03-19 | 1 | -0/+3 |
| | |||||
* | entity worker: ingest more releases | Bryan Newbold | 2020-02-22 | 1 | -1/+37 |
| | | | | | | | | If release is a dataset or image, don't do a pdf ingest request. If release is a datacite DOI, and release_type is a "document", crawl regardless of is_oa detection. This is mostly to crawl repositories (institutional or subject). | ||||
* | always crawl researchgate DOIs | Bryan Newbold | 2020-02-18 | 1 | -0/+2 |
| | | | | Now that ingest is fixed | ||||
* | add acceptlist override for biorxiv/medrxiv | Bryan Newbold | 2020-02-10 | 1 | -2/+12 |
| | |||||
* | fix KafkaError worker reporting for partition errors | Bryan Newbold | 2020-01-29 | 2 | -2/+2 |
| | |||||
* | additional DOI prefix filters | Bryan Newbold | 2020-01-28 | 1 | -0/+8 |
| | | | | From martin, thanks. | ||||
* | apply ingest request filtering in entity worker | Bryan Newbold | 2020-01-28 | 1 | -3/+34 |
| | | | | | | | `ingest_oa_only` behavior, and other filters, now handled in the entity update worker, instead of in the transform function. Also add a DOI prefix blocklist feature. | ||||
* | update ingest request schema | Bryan Newbold | 2019-12-13 | 1 | -1/+1 |
| | | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups. | ||||
* | project -> ingest_request_source | Bryan Newbold | 2019-11-15 | 1 | -1/+1 |
| | |||||
* | add ingest request feature to entity_updates worker | Bryan Newbold | 2019-11-15 | 1 | -4/+20 |
| | | | | | | | | | | | | | Initially was going to create a new worker to consume from the release update channel, but couldn't get the edit context ("is this a new release, or update to an existing") from that context. Currently there is a flag in source code to control whether we only do OA releases or all releases. Starting with OA only to start slow, but should probably default to all, and make this a config flag. Should probably also have a config flag to control this entire feature. Tested locally in dev. | ||||
* | review/fix all confluent-kafka produce code | Bryan Newbold | 2019-09-20 | 2 | -12/+26 |
| | |||||
* | small fixes to confluent-kafka importers/workers | Bryan Newbold | 2019-09-20 | 3 | -12/+41 |
| | | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again) | ||||
* | convert pipeline workers from pykafka to confluent-kafka | Bryan Newbold | 2019-09-20 | 3 | -125/+230 |
| | |||||
* | refactor all python source for client lib name | Bryan Newbold | 2019-09-05 | 2 | -3/+3 |
| |