summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers
Commit message (Collapse)AuthorAgeFilesLines
* imports: generic file cleanup removes exact duplicate URLsBryan Newbold2021-11-091-0/+9
|
* datacite importer: remove unused 'year_only' variableBryan Newbold2021-11-031-2/+3
|
* datacite: add comment about potential date parsing bugBryan Newbold2021-11-031-0/+1
|
* datacite importer: dateparser.date.DateDataParser()Bryan Newbold2021-11-031-1/+1
| | | | Perhaps this was a change when upgrading 'dateparser'?
* more involved type wrangling and fixes for importersBryan Newbold2021-11-033-12/+14
|
* typing: relatively simple type check fixesBryan Newbold2021-11-0314-87/+82
| | | | | | | These mostly add new variable names so that existing variables aren't overwritten with a new type; delay coercing '{}' or '[]' to 'None' until the last minute; adding is-not-None checks to conditional clauses; and similar small changes.
* typing: initial annotations on importersBryan Newbold2021-11-0322-274/+443
| | | | | This commit just adds the type annotations, doesn't do fixes to code to make type checking pass.
* importers: remove unused __main__ routineBryan Newbold2021-11-034-19/+0
| | | | | | These perhaps were used in initial develoment or testing? fatcat_import.py is the correct way to do these imports, even for testing/development.
* lint: resolve existing mypy type errorsBryan Newbold2021-11-023-22/+27
| | | | | | | | | Adds annotations and re-workes dataflow to satisfy existing mypy issues, without adding any additional type annotations to, eg, function signatures. There will probably be many more type errors when annotations are all added.
* re-fix some lint issues after big 'fmt'Bryan Newbold2021-11-021-2/+2
|
* fmt (black): fatcat_tools/Bryan Newbold2021-11-0222-2115/+2578
|
* python: isort everythingBryan Newbold2021-11-0217-41/+70
|
* arabesque import 'hit' field is 1/0, not true/falseBryan Newbold2021-11-021-2/+2
|
* lint: simple, safe inline lint fixesBryan Newbold2021-11-0212-22/+21
| | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc
* lint/fmt: remove all 'import *'Bryan Newbold2021-11-025-21/+41
|
* re-fmt all the fatcat_tools __init__ files for readabilityBryan Newbold2021-11-021-17/+39
|
* small python tweaks for annotations, importsBryan Newbold2021-11-022-2/+6
|
* try some type annotationsBryan Newbold2021-11-022-55/+63
|
* fix missing variable in fileset ingestBryan Newbold2021-11-021-2/+1
|
* WIP: more fileset ingestBryan Newbold2021-10-181-13/+21
|
* WIP: rel fixesBryan Newbold2021-10-141-6/+6
|
* fileset ingest small tweaksBryan Newbold2021-10-141-21/+36
|
* initial implementation of fileset ingest importersBryan Newbold2021-10-142-3/+224
|
* generic fileset importer class, with test coverageBryan Newbold2021-10-143-0/+88
|
* dblp import: basic support for handles as identifiersBryan Newbold2021-10-131-1/+5
|
* dblp import: fix typos in identifier parsingBryan Newbold2021-10-131-2/+1
|
* python: partial importer utilization of new schema changesBryan Newbold2021-10-133-6/+18
|
* Merge branch 'bnewbold-ingest-tweaks' into 'master'bnewbold2021-10-023-39/+106
|\ | | | | | | | | ingest importer behavior tweaks See merge request webgroup/fatcat!120
| * kafka import: optional 'force-flush' mode for some importersBryan Newbold2021-10-011-0/+13
| | | | | | | | Behavior and motivation described in the kafka json import comment.
| * new SPN web (html) importerBryan Newbold2021-10-012-27/+81
| |
| * ingest importer behavior tweaksBryan Newbold2021-10-011-8/+8
| | | | | | | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN
| * importer common: more verbose logging (with counts)Bryan Newbold2021-10-011-4/+4
| |
* | datacite: skip empty abstractsMartin Czygan2021-10-011-1/+4
|/ | | | | Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required`
* more consistent and defensive lower-casing of DOIsBryan Newbold2021-06-232-1/+6
| | | | | | | After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field.
* datacite: more careful title string access; fixes sentry #88350Martin Czygan2021-06-111-1/+1
| | | | | Caused by a partial "title entry without title" coming *first* (e.g. just holding, e.g. a language, like: {'lang': 'da'}
* ingest: swap ingest and file checks, to result in clearer stats/counts of ↵Bryan Newbold2021-06-031-2/+2
| | | | skipping
* ingest: don't accept mag and s2 URLsBryan Newbold2021-06-031-4/+4
|
* small python lint fixes (no behavior change)Bryan Newbold2021-05-251-2/+0
|
* arabesque importer: ensure full 14-digit timestampsBryan Newbold2021-05-211-1/+3
|
* datacite: a missing surname should be None, not the empty stringMartin Czygan2021-04-021-2/+1
| | | | refs sentry #77700
* web ingest: terminal URL mismatch as skip, not assertBryan Newbold2020-12-301-1/+3
|
* dblp release import: skip arxiv_id releasesBryan Newbold2020-12-241-0/+9
|
* dblp import: fix arxiv_id typoBryan Newbold2020-12-231-1/+1
| | | | Would have been caught by mypy!
* ingest: allow dblp importsBryan Newbold2020-12-231-1/+1
|
* fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
|
* dblp: polish HTML scrape/extract pipelineBryan Newbold2020-12-171-0/+14
|
* dblp: flesh out update code path (especially to add container_id linkage)Bryan Newbold2020-12-171-2/+6
|
* dblp: run fuzzy matching at try_update time (same as DOAJ)Bryan Newbold2020-12-171-1/+8
|
* improve dblp release importBryan Newbold2020-12-171-1/+2
|
* very simple dblp container importerBryan Newbold2020-12-172-0/+145
|