summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'bnewbold-import-fileset'Bryan Newbold2021-11-025-4/+350
|\
| * WIP: more fileset ingestBryan Newbold2021-10-181-13/+21
| |
| * WIP: rel fixesBryan Newbold2021-10-141-6/+6
| |
| * fileset ingest small tweaksBryan Newbold2021-10-141-21/+36
| |
| * initial implementation of fileset ingest importersBryan Newbold2021-10-142-3/+224
| |
| * ingest: handle datasets, components, other ingest typesBryan Newbold2021-10-141-1/+15
| |
| * generic fileset importer class, with test coverageBryan Newbold2021-10-143-0/+88
| |
* | Merge branch 'bnewbold-match-get'Bryan Newbold2021-11-021-3/+9
|\ \
| * | access: populate thumbnail_url for PDFsBryan Newbold2021-10-181-3/+9
| |/
* / pubmed: switch default http site to retrieve update filesMartin Czygan2021-10-151-2/+4
|/ | | | | | | Proxy started to throw: "dial tcp: lookup ftp.ncbi.nlm.nih.gov on [::1]:53: read udp [::1]:45178->[::1]:53: read: connection refused" NIH has a http version on it's own, try to use that.
* dblp import: basic support for handles as identifiersBryan Newbold2021-10-131-1/+5
|
* python: normalization/validation support for handle identifiers (hdl)Bryan Newbold2021-10-131-0/+33
|
* dblp import: fix typos in identifier parsingBryan Newbold2021-10-131-2/+1
|
* python: partial importer utilization of new schema changesBryan Newbold2021-10-133-6/+18
|
* python: implement ES schema changesBryan Newbold2021-10-131-4/+17
|
* Merge branch 'bnewbold-ingest-tweaks' into 'master'bnewbold2021-10-023-39/+106
|\ | | | | | | | | ingest importer behavior tweaks See merge request webgroup/fatcat!120
| * kafka import: optional 'force-flush' mode for some importersBryan Newbold2021-10-011-0/+13
| | | | | | | | Behavior and motivation described in the kafka json import comment.
| * new SPN web (html) importerBryan Newbold2021-10-012-27/+81
| |
| * ingest importer behavior tweaksBryan Newbold2021-10-011-8/+8
| | | | | | | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN
| * importer common: more verbose logging (with counts)Bryan Newbold2021-10-011-4/+4
| |
* | datacite: skip empty abstractsMartin Czygan2021-10-011-1/+4
|/ | | | | Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required`
* pubmed: workaround a networking issueMartin Czygan2021-09-091-24/+21
| | | | | | use an http proxy (https://github.com/miku/ftpup) to fetch files from FTP, keep some retry logic; also, hardcoding the proxy path as this should be a temporary workaround
* pubmed: add option to ftp download with lftpMartin Czygan2021-09-081-2/+31
| | | | | lftp is a classic command line ftp client, and we hope that its retry capabilities are enough of a workaround for the current networking issue
* pubmed harvester: add basic retry logicMartin Czygan2021-08-201-8/+21
| | | | | | | | Related to a previous issue with seemingly random EOFError from FTP connections, this patch wrap "ftpretr" helper function with a basic retry. Refs: fatcat-workers/issues/92151, fatcat-workers/issues/91102
* refs: default to *not* consolidating worksBryan Newbold2021-08-061-1/+1
| | | | | | | We don't handle counts for consolidated refs yet, so just don't consolidate. This should fix, eg, "Showing 1-18 of 19" type UX confusion, with the trade-off that some works will be duplicated in inbound ref tables.
* refs: lint fixesBryan Newbold2021-07-271-0/+1
|
* refs: support for wikipedia outbound refs, and display in tablesBryan Newbold2021-07-271-2/+2
|
* refs: generalize web endpoints; JSON content negotiation; openlibrary ↵Bryan Newbold2021-07-232-22/+57
| | | | inbound view; etc
* refs: small refactors/tweaksBryan Newbold2021-07-231-11/+17
|
* remove unused imports (lint)Bryan Newbold2021-07-232-3/+2
|
* pylint: skip pydantic import check (dynamic/extensions)Bryan Newbold2021-07-231-8/+2
|
* refs: refactor web paths; enrich refs as generic; remove old refs linkBryan Newbold2021-07-231-50/+35
|
* refs fetch: add some hacks; sort hitsBryan Newbold2021-07-231-6/+16
|
* fixes for newer ref indexBryan Newbold2021-07-231-1/+1
|
* references: refactor to point to access_options transform; comment out CSL ↵Bryan Newbold2021-07-231-57/+8
| | | | fields
* partial access options transform for releasesBryan Newbold2021-07-231-0/+58
|
* initial inbound/outbound reference query helpersBryan Newbold2021-07-231-0/+450
|
* pubmed: update docsMartin Czygan2021-07-171-2/+3
|
* pubmed: do not fail when accessing missing fileMartin Czygan2021-07-171-2/+8
| | | | | | | after a sync gap (e.g. 06/07 2021) harvester wanted to fetch a file, that was not on the server (any more) - do not fail in this case we'll need to backfill missing records via full data dump
* pubmed: reconnect on errorMartin Czygan2021-07-161-4/+30
| | | | | | | | | ftp retrieval would run but fail with EOFError on /pubmed/updatefiles/pubmed21n1328_stats.html - not able to find the root cause; using a fresh client, the exact same file would work just fine. So when we retry, we reconnect on failure. Refs: sentry #91102.
* more consistent and defensive lower-casing of DOIsBryan Newbold2021-06-233-3/+8
| | | | | | | After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field.
* datacite: more careful title string access; fixes sentry #88350Martin Czygan2021-06-111-1/+1
| | | | | Caused by a partial "title entry without title" coming *first* (e.g. just holding, e.g. a language, like: {'lang': 'da'}
* clean_doi() should lower-case returned DOIBryan Newbold2021-06-071-1/+4
| | | | | | | | | | Code in a number of places (including Pubmed importer) assumed that this was already lower-casing DOIs, resulting in some broken metadata getting created. See also: https://github.com/internetarchive/fatcat/issues/83 This is just the first step of mitigation.
* ingest: swap ingest and file checks, to result in clearer stats/counts of ↵Bryan Newbold2021-06-031-2/+2
| | | | skipping
* ingest: don't accept mag and s2 URLsBryan Newbold2021-06-031-4/+4
|
* changelog worker: fix file/fileset typo, caught by lintBryan Newbold2021-05-251-1/+1
| | | | | This would have been resulting in some releases not getting re-indexed into search.
* small python lint fixes (no behavior change)Bryan Newbold2021-05-253-4/+2
|
* ingest: add per-container ingest type overridesBryan Newbold2021-05-211-1/+17
|
* arabesque importer: ensure full 14-digit timestampsBryan Newbold2021-05-211-1/+3
|
* transforms: fix 'display_ame' typoBryan Newbold2021-04-191-2/+2
|