| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
| |
'==' vs 'is'; 'not a in b' vs 'a not in b'; etc
|
| |
|
|
|
|
|
|
|
| |
Proxy started to throw: "dial tcp: lookup ftp.ncbi.nlm.nih.gov on
[::1]:53: read udp [::1]:45178->[::1]:53: read: connection refused"
NIH has a http version on it's own, try to use that.
|
|
|
|
|
|
| |
use an http proxy (https://github.com/miku/ftpup) to fetch files from
FTP, keep some retry logic; also, hardcoding the proxy path as this
should be a temporary workaround
|
|
|
|
|
| |
lftp is a classic command line ftp client, and we hope that its retry
capabilities are enough of a workaround for the current networking issue
|
|
|
|
|
|
|
|
| |
Related to a previous issue with seemingly random EOFError from FTP
connections, this patch wrap "ftpretr" helper function with a basic
retry.
Refs: fatcat-workers/issues/92151, fatcat-workers/issues/91102
|
| |
|
|
|
|
|
|
|
| |
after a sync gap (e.g. 06/07 2021) harvester wanted to fetch a file,
that was not on the server (any more) - do not fail in this case
we'll need to backfill missing records via full data dump
|
|
|
|
|
|
|
|
|
| |
ftp retrieval would run but fail with EOFError on
/pubmed/updatefiles/pubmed21n1328_stats.html - not able to find the root
cause; using a fresh client, the exact same file would work just
fine. So when we retry, we reconnect on failure.
Refs: sentry #91102.
|
| |
|
|
|
|
| |
As a first step: log response body for debugging.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
In the past harvest of datacite resulted in occasional HTTP 400.
Meanwhile, various API bugs have been fixed (most recently:
https://github.com/datacite/lupo/pull/537,
https://github.com/datacite/datacite/issues/1038). Downside of ignoring
this error was that state lives in kafka, which has limited support for
deletion of arbitrary messages from a topic.
|
|
|
|
|
|
|
|
|
| |
"span" short for "timespan" to harvest; there may be a better name to
use.
Motivation for this is to work around a pylint erorr that .next() was
not callable. This might be a bug with pylint, but .next() is also a
very generic name.
|
|
|
|
|
| |
It seems to be an inadvertantly ugraded version of pylint saying that
these lines are not-callable.
|
|
|
|
|
|
| |
This goes against what the API docs recommend, but we are currently far
behind on updates and need to catch up. Other than what the docs say,
this seems to be consistent with the behavior we want.
|
| |
|
|\
| |
| | |
Correct spelling mistakes
|
| | |
|
| | |
|
| |
| |
| |
| |
| | |
* fetch_date will fail on missing mapping
* adjust tests (test will require access to pubmed ftp)
|
| | |
|
| |
| |
| |
| |
| | |
> Each day, NLM produces update files that include new, revised and
deleted citations. -- ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt
|
| | |
|
| | |
|
| |
| |
| |
| |
| | |
* regenerate map in continuous mode
* add tests
|
| | |
|
|/
|
|
|
|
|
| |
* add PubmedFTPWorker
* utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream)
but may live elsewhere, as they are more generic
* add KafkaBs4XmlPusher
|
| |
|
|
|
|
|
| |
The bracket syntax is inclusive. See also:
https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-query-string-query.html#_ranges
|
| |
|
|
|
|
|
|
|
|
|
|
| |
As a first iteration, just mark the daily batch complete and continue.
The occasional HTTP 400 issue has been reported as
https://github.com/datacite/datacite/issues/897.
A possible improvement would be to shrink the window, so losses will be
smaller.
|
| |
|
|
|
|
|
|
|
|
|
| |
Update parameter update for datacite API v2. Works fine, but there are
occasional HTTP 400 responses when using the cursor API (daily updates
can exceed the 10000 record limit for search queries).
The HTTP 400 issue is not solved yet, but reported to datacite as
https://github.com/datacite/datacite/issues/897.
|
|
|
|
|
|
|
|
| |
producer creation/configuration should be happening in __init__() time,
not 'daily' call.
This specific refactor motivated by mocking out the producer in unit
tests.
|
|
|
|
|
|
|
|
| |
I thought this would filter for metadata updates to an existing DOI, but
actually "updates" are a type of DOI (eg, a retraction).
TODO: handle 'updates' field. Should both do a lookup and set work_ident
appropriately, and store in crossref-specific metadata.
|
| |
|
|
|
|
|
|
|
|
| |
- decrease default changelog pipeline to 5.0sec
- fix missing KafkaException harvester imports
- more confluent-kafka tweaks
- updates to kafka consumer configs
- bump elastic updates consumergroup (again)
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
As the code comment mentions, not sure why pylint throws this error.
requests and urllib3 are recent, and this code runs fine in tests and
QA, and pylint is running (in CI) within pipenv.
|
|
|
|
| |
Also, arXivRaw, not arXiv (though see WIP on more-importers branch)
|
| |
|
| |
|