summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools
Commit message (Collapse)AuthorAgeFilesLines
* pubmed: allow updates if PMCID does not exist yetBryan Newbold2021-11-101-1/+6
| | | | | | | | | | | The intent of this change is to start updating Pubmed metadata records when a PMCID has been assigned, but that ext_id hasn't been recorded in fatcat yet. It is likely that this change will result in some additional duplicate PMCIDs in the catalog. But the principle is that the PMID is the primary pubmed identifier, and all records with a PMID should have the PMCID that pubmed indicates, even if there exists another incorrect record.
* cleanups: create a separate JsonLinePusher for cleanup workers (distinct ↵Bryan Newbold2021-11-032-2/+19
| | | | base class)
* datacite importer: remove unused 'year_only' variableBryan Newbold2021-11-031-2/+3
|
* pubmed harvester: remove unused variablesBryan Newbold2021-11-031-2/+2
|
* pubmed harvester: explicit assertions to mark unreachable code pathsBryan Newbold2021-11-031-0/+2
|
* typing: add assertions to fatcat_tool code to make type assumptions explicitBryan Newbold2021-11-033-0/+3
|
* typing: add annotations to remaining fatcat_tools codeBryan Newbold2021-11-039-122/+186
| | | | | Again, these are just annotations, no changes made to get type checks to pass
* datacite: add comment about potential date parsing bugBryan Newbold2021-11-031-0/+1
|
* datacite importer: dateparser.date.DateDataParser()Bryan Newbold2021-11-031-1/+1
| | | | Perhaps this was a change when upgrading 'dateparser'?
* more involved type wrangling and fixes for importersBryan Newbold2021-11-033-12/+14
|
* typing: relatively simple type check fixesBryan Newbold2021-11-0314-87/+82
| | | | | | | These mostly add new variable names so that existing variables aren't overwritten with a new type; delay coercing '{}' or '[]' to 'None' until the last minute; adding is-not-None checks to conditional clauses; and similar small changes.
* typing: initial annotations on importersBryan Newbold2021-11-0322-274/+443
| | | | | This commit just adds the type annotations, doesn't do fixes to code to make type checking pass.
* typing: first batch of python bulk type annotationsBryan Newbold2021-11-039-69/+129
| | | | | | While these changes are more delicate than simple lint changes, this specific batch of edits and annotations was *relatively* simple, and resulted in few code changes other than function signature additions.
* importers: remove unused __main__ routineBryan Newbold2021-11-034-19/+0
| | | | | | These perhaps were used in initial develoment or testing? fatcat_import.py is the correct way to do these imports, even for testing/development.
* lint: resolve existing mypy type errorsBryan Newbold2021-11-028-50/+86
| | | | | | | | | Adds annotations and re-workes dataflow to satisfy existing mypy issues, without adding any additional type annotations to, eg, function signatures. There will probably be many more type errors when annotations are all added.
* re-fix some lint issues after big 'fmt'Bryan Newbold2021-11-022-4/+5
|
* fmt (black): fatcat_tools/Bryan Newbold2021-11-0243-3194/+4020
|
* python: isort everythingBryan Newbold2021-11-0232-71/+116
|
* arabesque import 'hit' field is 1/0, not true/falseBryan Newbold2021-11-021-2/+2
|
* lint: simple, safe inline lint fixesBryan Newbold2021-11-0218-83/+82
| | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc
* lint/fmt: remove all 'import *'Bryan Newbold2021-11-025-21/+41
|
* entity transforms: add basic type annotationsBryan Newbold2021-11-021-7/+19
|
* ftfy 'fix_entities' argument has been renamedBryan Newbold2021-11-021-4/+4
|
* hacks to work around new pylint false positivesBryan Newbold2021-11-021-2/+3
|
* cleanup imports after fatcat_tools.transforms changeBryan Newbold2021-11-021-5/+8
|
* re-fmt all the fatcat_tools __init__ files for readabilityBryan Newbold2021-11-025-30/+62
|
* remove 'import *' from fatcat_tools (for transforms)Bryan Newbold2021-11-021-2/+2
|
* small python tweaks for annotations, importsBryan Newbold2021-11-023-3/+7
|
* try some type annotationsBryan Newbold2021-11-024-70/+79
|
* reviewer: add annotations required by mypyBryan Newbold2021-11-021-2/+3
|
* fix missing variable in fileset ingestBryan Newbold2021-11-021-2/+1
|
* Merge branch 'bnewbold-import-fileset'Bryan Newbold2021-11-025-4/+350
|\
| * WIP: more fileset ingestBryan Newbold2021-10-181-13/+21
| |
| * WIP: rel fixesBryan Newbold2021-10-141-6/+6
| |
| * fileset ingest small tweaksBryan Newbold2021-10-141-21/+36
| |
| * initial implementation of fileset ingest importersBryan Newbold2021-10-142-3/+224
| |
| * ingest: handle datasets, components, other ingest typesBryan Newbold2021-10-141-1/+15
| |
| * generic fileset importer class, with test coverageBryan Newbold2021-10-143-0/+88
| |
* | Merge branch 'bnewbold-match-get'Bryan Newbold2021-11-021-3/+9
|\ \
| * | access: populate thumbnail_url for PDFsBryan Newbold2021-10-181-3/+9
| |/
* / pubmed: switch default http site to retrieve update filesMartin Czygan2021-10-151-2/+4
|/ | | | | | | Proxy started to throw: "dial tcp: lookup ftp.ncbi.nlm.nih.gov on [::1]:53: read udp [::1]:45178->[::1]:53: read: connection refused" NIH has a http version on it's own, try to use that.
* dblp import: basic support for handles as identifiersBryan Newbold2021-10-131-1/+5
|
* python: normalization/validation support for handle identifiers (hdl)Bryan Newbold2021-10-131-0/+33
|
* dblp import: fix typos in identifier parsingBryan Newbold2021-10-131-2/+1
|
* python: partial importer utilization of new schema changesBryan Newbold2021-10-133-6/+18
|
* python: implement ES schema changesBryan Newbold2021-10-131-4/+17
|
* Merge branch 'bnewbold-ingest-tweaks' into 'master'bnewbold2021-10-023-39/+106
|\ | | | | | | | | ingest importer behavior tweaks See merge request webgroup/fatcat!120
| * kafka import: optional 'force-flush' mode for some importersBryan Newbold2021-10-011-0/+13
| | | | | | | | Behavior and motivation described in the kafka json import comment.
| * new SPN web (html) importerBryan Newbold2021-10-012-27/+81
| |
| * ingest importer behavior tweaksBryan Newbold2021-10-011-8/+8
| | | | | | | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN