summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/harvest/pubmed.py
Commit message (Collapse)AuthorAgeFilesLines
* pubmed: switch default http site to retrieve update filesMartin Czygan2021-10-151-2/+4
| | | | | | | Proxy started to throw: "dial tcp: lookup ftp.ncbi.nlm.nih.gov on [::1]:53: read udp [::1]:45178->[::1]:53: read: connection refused" NIH has a http version on it's own, try to use that.
* pubmed: workaround a networking issueMartin Czygan2021-09-091-24/+21
| | | | | | use an http proxy (https://github.com/miku/ftpup) to fetch files from FTP, keep some retry logic; also, hardcoding the proxy path as this should be a temporary workaround
* pubmed: add option to ftp download with lftpMartin Czygan2021-09-081-2/+31
| | | | | lftp is a classic command line ftp client, and we hope that its retry capabilities are enough of a workaround for the current networking issue
* pubmed harvester: add basic retry logicMartin Czygan2021-08-201-8/+21
| | | | | | | | Related to a previous issue with seemingly random EOFError from FTP connections, this patch wrap "ftpretr" helper function with a basic retry. Refs: fatcat-workers/issues/92151, fatcat-workers/issues/91102
* pubmed: update docsMartin Czygan2021-07-171-2/+3
|
* pubmed: do not fail when accessing missing fileMartin Czygan2021-07-171-2/+8
| | | | | | | after a sync gap (e.g. 06/07 2021) harvester wanted to fetch a file, that was not on the server (any more) - do not fail in this case we'll need to backfill missing records via full data dump
* pubmed: reconnect on errorMartin Czygan2021-07-161-4/+30
| | | | | | | | | ftp retrieval would run but fail with EOFError on /pubmed/updatefiles/pubmed21n1328_stats.html - not able to find the root cause; using a fresh client, the exact same file would work just fine. So when we retry, we reconnect on failure. Refs: sentry #91102.
* lint (flake8) tool python filesBryan Newbold2020-07-011-1/+1
|
* rename HarvestState.next() to HarvestState.next_span()Bryan Newbold2020-05-261-1/+1
| | | | | | | | | "span" short for "timespan" to harvest; there may be a better name to use. Motivation for this is to work around a pylint erorr that .next() was not callable. This might be a bug with pylint, but .next() is also a very generic name.
* HACK: skip pylint errors on lines that seem to be fineBryan Newbold2020-05-221-1/+1
| | | | | It seems to be an inadvertantly ugraded version of pylint saying that these lines are not-callable.
* pubmed: log to stderrMartin Czygan2020-03-101-1/+1
|
* pubmed: move mapping generation out of fetch_dateMartin Czygan2020-03-101-7/+8
| | | | | * fetch_date will fail on missing mapping * adjust tests (test will require access to pubmed ftp)
* pubmed: citations is a bit more preciseMartin Czygan2020-03-091-1/+1
| | | | | > Each day, NLM produces update files that include new, revised and deleted citations. -- ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt
* pubmed: we sync from FTPMartin Czygan2020-03-091-1/+1
|
* more pubmed adjustmentsMartin Czygan2020-02-221-70/+117
| | | | | * regenerate map in continuous mode * add tests
* pubmed ftp: fix urlMartin Czygan2020-02-191-4/+6
|
* pubmed ftp harvest and KafkaBs4XmlPusherMartin Czygan2020-02-191-0/+199
* add PubmedFTPWorker * utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream) but may live elsewhere, as they are more generic * add KafkaBs4XmlPusher