Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | pubmed: reconnect on error | Martin Czygan | 2021-07-16 | 1 | -4/+30 |
| | | | | | | | | | ftp retrieval would run but fail with EOFError on /pubmed/updatefiles/pubmed21n1328_stats.html - not able to find the root cause; using a fresh client, the exact same file would work just fine. So when we retry, we reconnect on failure. Refs: sentry #91102. | ||||
* | small python lint fixes (no behavior change) | Bryan Newbold | 2021-05-25 | 1 | -1/+1 |
| | |||||
* | harvest: datacite API yields HTTP 200 with broken JSON | Martin Czygan | 2020-08-10 | 1 | -1/+8 |
| | | | | As a first step: log response body for debugging. | ||||
* | arxiv: do retry five times of HTTP 503 | Martin Czygan | 2020-07-10 | 1 | -1/+1 |
| | |||||
* | lint (flake8) tool python files | Bryan Newbold | 2020-07-01 | 4 | -19/+6 |
| | |||||
* | harvest: fail on HTTP 400 | Martin Czygan | 2020-05-29 | 1 | -4/+0 |
| | | | | | | | | | In the past harvest of datacite resulted in occasional HTTP 400. Meanwhile, various API bugs have been fixed (most recently: https://github.com/datacite/lupo/pull/537, https://github.com/datacite/datacite/issues/1038). Downside of ignoring this error was that state lives in kafka, which has limited support for deletion of arbitrary messages from a topic. | ||||
* | rename HarvestState.next() to HarvestState.next_span() | Bryan Newbold | 2020-05-26 | 4 | -5/+5 |
| | | | | | | | | | "span" short for "timespan" to harvest; there may be a better name to use. Motivation for this is to work around a pylint erorr that .next() was not callable. This might be a bug with pylint, but .next() is also a very generic name. | ||||
* | HACK: skip pylint errors on lines that seem to be fine | Bryan Newbold | 2020-05-22 | 3 | -3/+3 |
| | | | | | It seems to be an inadvertantly ugraded version of pylint saying that these lines are not-callable. | ||||
* | crossref: switch from index-date to update-date | Bryan Newbold | 2020-03-30 | 1 | -1/+1 |
| | | | | | | This goes against what the API docs recommend, but we are currently far behind on updates and need to catch up. Other than what the docs say, this seems to be consistent with the behavior we want. | ||||
* | crossref: longer comment about crossref API date fields | Bryan Newbold | 2020-03-30 | 1 | -2/+22 |
| | |||||
* | Merge pull request #53 from EdwardBetts/spelling | bnewbold | 2020-03-27 | 1 | -2/+2 |
|\ | | | | | Correct spelling mistakes | ||||
| * | Correct spelling mistakes | Edward Betts | 2020-03-27 | 1 | -2/+2 |
| | | |||||
* | | pubmed: log to stderr | Martin Czygan | 2020-03-10 | 1 | -1/+1 |
| | | |||||
* | | pubmed: move mapping generation out of fetch_date | Martin Czygan | 2020-03-10 | 1 | -7/+8 |
| | | | | | | | | | | * fetch_date will fail on missing mapping * adjust tests (test will require access to pubmed ftp) | ||||
* | | harvest: fix imports from HarvestPubmedWorker cleanup | Martin Czygan | 2020-03-10 | 1 | -2/+2 |
| | | |||||
* | | pubmed: citations is a bit more precise | Martin Czygan | 2020-03-09 | 1 | -1/+1 |
| | | | | | | | | | | > Each day, NLM produces update files that include new, revised and deleted citations. -- ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt | ||||
* | | pubmed: we sync from FTP | Martin Czygan | 2020-03-09 | 1 | -1/+1 |
| | | |||||
* | | oaipmh: HarvestPubmedWorker obsoleted by PubmedFTPWorker | Martin Czygan | 2020-03-09 | 1 | -34/+0 |
| | | |||||
* | | more pubmed adjustments | Martin Czygan | 2020-02-22 | 2 | -70/+118 |
| | | | | | | | | | | * regenerate map in continuous mode * add tests | ||||
* | | pubmed ftp: fix url | Martin Czygan | 2020-02-19 | 1 | -4/+6 |
| | | |||||
* | | pubmed ftp harvest and KafkaBs4XmlPusher | Martin Czygan | 2020-02-19 | 2 | -0/+214 |
|/ | | | | | | | * add PubmedFTPWorker * utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream) but may live elsewhere, as they are more generic * add KafkaBs4XmlPusher | ||||
* | harvest: log state on startup and use stderr for diagnostics | Martin Czygan | 2020-02-14 | 3 | -17/+22 |
| | |||||
* | datacite: extend range search query | Martin Czygan | 2019-12-27 | 1 | -1/+1 |
| | | | | | The bracket syntax is inclusive. See also: https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-query-string-query.html#_ranges | ||||
* | avoid usage of short links | Martin Czygan | 2019-12-27 | 1 | -2/+2 |
| | |||||
* | Datacite API v2 throws 400, we cannot recover from, currently. | Martin Czygan | 2019-12-27 | 1 | -0/+4 |
| | | | | | | | | | | As a first iteration, just mark the daily batch complete and continue. The occasional HTTP 400 issue has been reported as https://github.com/datacite/datacite/issues/897. A possible improvement would be to shrink the window, so losses will be smaller. | ||||
* | datacite: update documentation, add links to issues | Martin Czygan | 2019-12-27 | 1 | -10/+5 |
| | |||||
* | datacite: use v2 of the API (flaky) | Martin Czygan | 2019-12-27 | 1 | -5/+28 |
| | | | | | | | | | Update parameter update for datacite API v2. Works fine, but there are occasional HTTP 400 responses when using the cursor API (daily updates can exceed the 10000 record limit for search queries). The HTTP 400 issue is not solved yet, but reported to datacite as https://github.com/datacite/datacite/issues/897. | ||||
* | refactor kafka producer in crossref harvester | Bryan Newbold | 2019-12-06 | 1 | -21/+26 |
| | | | | | | | | producer creation/configuration should be happening in __init__() time, not 'daily' call. This specific refactor motivated by mocking out the producer in unit tests. | ||||
* | crossref is_update isn't what I thought | Bryan Newbold | 2019-12-03 | 1 | -6/+2 |
| | | | | | | | | I thought this would filter for metadata updates to an existing DOI, but actually "updates" are a type of DOI (eg, a retraction). TODO: handle 'updates' field. Should both do a lookup and set work_ident appropriately, and store in crossref-specific metadata. | ||||
* | review/fix all confluent-kafka produce code | Bryan Newbold | 2019-09-20 | 3 | -14/+49 |
| | |||||
* | small fixes to confluent-kafka importers/workers | Bryan Newbold | 2019-09-20 | 2 | -2/+2 |
| | | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again) | ||||
* | small kafka tweaks for robustness | Bryan Newbold | 2019-09-20 | 1 | -0/+2 |
| | |||||
* | bump max message size to ~20 MBytes | Bryan Newbold | 2019-09-20 | 2 | -0/+2 |
| | |||||
* | fixes to confluent-kafka harvesters | Bryan Newbold | 2019-09-20 | 3 | -20/+21 |
| | |||||
* | first draft harvesters using confluent-kafka | Bryan Newbold | 2019-09-20 | 3 | -48/+104 |
| | |||||
* | increase default harvest window to 14 days | Bryan Newbold | 2019-04-01 | 1 | -2/+2 |
| | |||||
* | HACK: force pylint to ignore urllib3 Retry import | Bryan Newbold | 2019-03-15 | 1 | -1/+3 |
| | | | | | | As the code comment mentions, not sure why pylint throws this error. requests and urllib3 are recent, and this code runs fine in tests and QA, and pylint is running (in CI) within pipenv. | ||||
* | MEDLINE/Pubmed note | Bryan Newbold | 2019-03-15 | 1 | -2/+6 |
| | | | | Also, arXivRaw, not arXiv (though see WIP on more-importers branch) | ||||
* | fix harvester session.get() params | Bryan Newbold | 2019-03-06 | 1 | -5/+8 |
| | |||||
* | retry/backoff for Crossref harvester | Bryan Newbold | 2019-03-06 | 2 | -2/+24 |
| | |||||
* | bunch of lint/whitespace cleanups | Bryan Newbold | 2019-02-22 | 3 | -9/+6 |
| | |||||
* | check request status codes idiomatically | Bryan Newbold | 2018-12-29 | 1 | -2/+2 |
| | |||||
* | clean up harvester comments/docs | Bryan Newbold | 2018-11-21 | 3 | -50/+31 |
| | |||||
* | use isoformat() to format dates | Bryan Newbold | 2018-11-21 | 2 | -4/+4 |
| | | | | This shouldn't change behavior; it's just more consistent. | ||||
* | fix loop_sleep typo | Bryan Newbold | 2018-11-21 | 2 | -2/+2 |
| | |||||
* | fix datacite DOI extraction | Bryan Newbold | 2018-11-21 | 1 | -1/+1 |
| | |||||
* | fix OAI-PMH name/finished message | Bryan Newbold | 2018-11-21 | 1 | -1/+6 |
| | |||||
* | fix oai-pmh issue again | Bryan Newbold | 2018-11-21 | 1 | -13/+14 |
| | |||||
* | oaipmh: handle NoRecordsMatch | Bryan Newbold | 2018-11-21 | 1 | -5/+8 |
| | |||||
* | initial OAI-PMH harvesters | Bryan Newbold | 2018-11-19 | 3 | -5/+167 |
| |