aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/harvest
Commit message (Collapse)AuthorAgeFilesLines
* python: isort everythingBryan Newbold2021-11-023-9/+12
|
* lint: simple, safe inline lint fixesBryan Newbold2021-11-022-2/+2
| | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc
* re-fmt all the fatcat_tools __init__ files for readabilityBryan Newbold2021-11-021-4/+2
|
* pubmed: switch default http site to retrieve update filesMartin Czygan2021-10-151-2/+4
| | | | | | | Proxy started to throw: "dial tcp: lookup ftp.ncbi.nlm.nih.gov on [::1]:53: read udp [::1]:45178->[::1]:53: read: connection refused" NIH has a http version on it's own, try to use that.
* pubmed: workaround a networking issueMartin Czygan2021-09-091-24/+21
| | | | | | use an http proxy (https://github.com/miku/ftpup) to fetch files from FTP, keep some retry logic; also, hardcoding the proxy path as this should be a temporary workaround
* pubmed: add option to ftp download with lftpMartin Czygan2021-09-081-2/+31
| | | | | lftp is a classic command line ftp client, and we hope that its retry capabilities are enough of a workaround for the current networking issue
* pubmed harvester: add basic retry logicMartin Czygan2021-08-201-8/+21
| | | | | | | | Related to a previous issue with seemingly random EOFError from FTP connections, this patch wrap "ftpretr" helper function with a basic retry. Refs: fatcat-workers/issues/92151, fatcat-workers/issues/91102
* pubmed: update docsMartin Czygan2021-07-171-2/+3
|
* pubmed: do not fail when accessing missing fileMartin Czygan2021-07-171-2/+8
| | | | | | | after a sync gap (e.g. 06/07 2021) harvester wanted to fetch a file, that was not on the server (any more) - do not fail in this case we'll need to backfill missing records via full data dump
* pubmed: reconnect on errorMartin Czygan2021-07-161-4/+30
| | | | | | | | | ftp retrieval would run but fail with EOFError on /pubmed/updatefiles/pubmed21n1328_stats.html - not able to find the root cause; using a fresh client, the exact same file would work just fine. So when we retry, we reconnect on failure. Refs: sentry #91102.
* small python lint fixes (no behavior change)Bryan Newbold2021-05-251-1/+1
|
* harvest: datacite API yields HTTP 200 with broken JSONMartin Czygan2020-08-101-1/+8
| | | | As a first step: log response body for debugging.
* arxiv: do retry five times of HTTP 503Martin Czygan2020-07-101-1/+1
|
* lint (flake8) tool python filesBryan Newbold2020-07-014-19/+6
|
* harvest: fail on HTTP 400Martin Czygan2020-05-291-4/+0
| | | | | | | | | In the past harvest of datacite resulted in occasional HTTP 400. Meanwhile, various API bugs have been fixed (most recently: https://github.com/datacite/lupo/pull/537, https://github.com/datacite/datacite/issues/1038). Downside of ignoring this error was that state lives in kafka, which has limited support for deletion of arbitrary messages from a topic.
* rename HarvestState.next() to HarvestState.next_span()Bryan Newbold2020-05-264-5/+5
| | | | | | | | | "span" short for "timespan" to harvest; there may be a better name to use. Motivation for this is to work around a pylint erorr that .next() was not callable. This might be a bug with pylint, but .next() is also a very generic name.
* HACK: skip pylint errors on lines that seem to be fineBryan Newbold2020-05-223-3/+3
| | | | | It seems to be an inadvertantly ugraded version of pylint saying that these lines are not-callable.
* crossref: switch from index-date to update-dateBryan Newbold2020-03-301-1/+1
| | | | | | This goes against what the API docs recommend, but we are currently far behind on updates and need to catch up. Other than what the docs say, this seems to be consistent with the behavior we want.
* crossref: longer comment about crossref API date fieldsBryan Newbold2020-03-301-2/+22
|
* Merge pull request #53 from EdwardBetts/spellingbnewbold2020-03-271-2/+2
|\ | | | | Correct spelling mistakes
| * Correct spelling mistakesEdward Betts2020-03-271-2/+2
| |
* | pubmed: log to stderrMartin Czygan2020-03-101-1/+1
| |
* | pubmed: move mapping generation out of fetch_dateMartin Czygan2020-03-101-7/+8
| | | | | | | | | | * fetch_date will fail on missing mapping * adjust tests (test will require access to pubmed ftp)
* | harvest: fix imports from HarvestPubmedWorker cleanupMartin Czygan2020-03-101-2/+2
| |
* | pubmed: citations is a bit more preciseMartin Czygan2020-03-091-1/+1
| | | | | | | | | | > Each day, NLM produces update files that include new, revised and deleted citations. -- ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt
* | pubmed: we sync from FTPMartin Czygan2020-03-091-1/+1
| |
* | oaipmh: HarvestPubmedWorker obsoleted by PubmedFTPWorkerMartin Czygan2020-03-091-34/+0
| |
* | more pubmed adjustmentsMartin Czygan2020-02-222-70/+118
| | | | | | | | | | * regenerate map in continuous mode * add tests
* | pubmed ftp: fix urlMartin Czygan2020-02-191-4/+6
| |
* | pubmed ftp harvest and KafkaBs4XmlPusherMartin Czygan2020-02-192-0/+214
|/ | | | | | | * add PubmedFTPWorker * utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream) but may live elsewhere, as they are more generic * add KafkaBs4XmlPusher
* harvest: log state on startup and use stderr for diagnosticsMartin Czygan2020-02-143-17/+22
|
* datacite: extend range search queryMartin Czygan2019-12-271-1/+1
| | | | | The bracket syntax is inclusive. See also: https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-query-string-query.html#_ranges
* avoid usage of short linksMartin Czygan2019-12-271-2/+2
|
* Datacite API v2 throws 400, we cannot recover from, currently.Martin Czygan2019-12-271-0/+4
| | | | | | | | | | As a first iteration, just mark the daily batch complete and continue. The occasional HTTP 400 issue has been reported as https://github.com/datacite/datacite/issues/897. A possible improvement would be to shrink the window, so losses will be smaller.
* datacite: update documentation, add links to issuesMartin Czygan2019-12-271-10/+5
|
* datacite: use v2 of the API (flaky)Martin Czygan2019-12-271-5/+28
| | | | | | | | | Update parameter update for datacite API v2. Works fine, but there are occasional HTTP 400 responses when using the cursor API (daily updates can exceed the 10000 record limit for search queries). The HTTP 400 issue is not solved yet, but reported to datacite as https://github.com/datacite/datacite/issues/897.
* refactor kafka producer in crossref harvesterBryan Newbold2019-12-061-21/+26
| | | | | | | | producer creation/configuration should be happening in __init__() time, not 'daily' call. This specific refactor motivated by mocking out the producer in unit tests.
* crossref is_update isn't what I thoughtBryan Newbold2019-12-031-6/+2
| | | | | | | | I thought this would filter for metadata updates to an existing DOI, but actually "updates" are a type of DOI (eg, a retraction). TODO: handle 'updates' field. Should both do a lookup and set work_ident appropriately, and store in crossref-specific metadata.
* review/fix all confluent-kafka produce codeBryan Newbold2019-09-203-14/+49
|
* small fixes to confluent-kafka importers/workersBryan Newbold2019-09-202-2/+2
| | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
* small kafka tweaks for robustnessBryan Newbold2019-09-201-0/+2
|
* bump max message size to ~20 MBytesBryan Newbold2019-09-202-0/+2
|
* fixes to confluent-kafka harvestersBryan Newbold2019-09-203-20/+21
|
* first draft harvesters using confluent-kafkaBryan Newbold2019-09-203-48/+104
|
* increase default harvest window to 14 daysBryan Newbold2019-04-011-2/+2
|
* HACK: force pylint to ignore urllib3 Retry importBryan Newbold2019-03-151-1/+3
| | | | | | As the code comment mentions, not sure why pylint throws this error. requests and urllib3 are recent, and this code runs fine in tests and QA, and pylint is running (in CI) within pipenv.
* MEDLINE/Pubmed noteBryan Newbold2019-03-151-2/+6
| | | | Also, arXivRaw, not arXiv (though see WIP on more-importers branch)
* fix harvester session.get() paramsBryan Newbold2019-03-061-5/+8
|
* retry/backoff for Crossref harvesterBryan Newbold2019-03-062-2/+24
|
* bunch of lint/whitespace cleanupsBryan Newbold2019-02-223-9/+6
|