aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers
Commit message (Expand)AuthorAgeFilesLines
* Merge branch 'bnewbold-import-refactors' into 'master'bnewbold2021-11-1116-1380/+146
|\
| * refactor importer metadata tables into separate file; move some helpers aroundBryan Newbold2021-11-108-621/+25
| * importers: refactor imports of clean() and other normalization helpersBryan Newbold2021-11-1012-95/+104
| * remove cdl_dash_dat and wayback_static importersBryan Newbold2021-11-103-510/+0
| * datacite import: store less subject metadataBryan Newbold2021-11-101-1/+7
| * importers: use clean_doi() in many more (all?) importersBryan Newbold2021-11-096-12/+29
| * remove deprecated extid sqlite3 lookup table feature from importersBryan Newbold2021-11-093-160/+0
* | Merge branch 'bnewbold-cleanups-nov2021' into 'master'bnewbold2021-11-111-0/+9
|\ \
| * | imports: generic file cleanup removes exact duplicate URLsBryan Newbold2021-11-091-0/+9
| |/
* / pubmed: allow updates if PMCID does not exist yetBryan Newbold2021-11-101-1/+6
|/
* datacite importer: remove unused 'year_only' variableBryan Newbold2021-11-031-2/+3
* datacite: add comment about potential date parsing bugBryan Newbold2021-11-031-0/+1
* datacite importer: dateparser.date.DateDataParser()Bryan Newbold2021-11-031-1/+1
* more involved type wrangling and fixes for importersBryan Newbold2021-11-033-12/+14
* typing: relatively simple type check fixesBryan Newbold2021-11-0314-87/+82
* typing: initial annotations on importersBryan Newbold2021-11-0322-274/+443
* importers: remove unused __main__ routineBryan Newbold2021-11-034-19/+0
* lint: resolve existing mypy type errorsBryan Newbold2021-11-023-22/+27
* re-fix some lint issues after big 'fmt'Bryan Newbold2021-11-021-2/+2
* fmt (black): fatcat_tools/Bryan Newbold2021-11-0222-2115/+2578
* python: isort everythingBryan Newbold2021-11-0217-41/+70
* arabesque import 'hit' field is 1/0, not true/falseBryan Newbold2021-11-021-2/+2
* lint: simple, safe inline lint fixesBryan Newbold2021-11-0212-22/+21
* lint/fmt: remove all 'import *'Bryan Newbold2021-11-025-21/+41
* re-fmt all the fatcat_tools __init__ files for readabilityBryan Newbold2021-11-021-17/+39
* small python tweaks for annotations, importsBryan Newbold2021-11-022-2/+6
* try some type annotationsBryan Newbold2021-11-022-55/+63
* fix missing variable in fileset ingestBryan Newbold2021-11-021-2/+1
* WIP: more fileset ingestBryan Newbold2021-10-181-13/+21
* WIP: rel fixesBryan Newbold2021-10-141-6/+6
* fileset ingest small tweaksBryan Newbold2021-10-141-21/+36
* initial implementation of fileset ingest importersBryan Newbold2021-10-142-3/+224
* generic fileset importer class, with test coverageBryan Newbold2021-10-143-0/+88
* dblp import: basic support for handles as identifiersBryan Newbold2021-10-131-1/+5
* dblp import: fix typos in identifier parsingBryan Newbold2021-10-131-2/+1
* python: partial importer utilization of new schema changesBryan Newbold2021-10-133-6/+18
* Merge branch 'bnewbold-ingest-tweaks' into 'master'bnewbold2021-10-023-39/+106
|\
| * kafka import: optional 'force-flush' mode for some importersBryan Newbold2021-10-011-0/+13
| * new SPN web (html) importerBryan Newbold2021-10-012-27/+81
| * ingest importer behavior tweaksBryan Newbold2021-10-011-8/+8
| * importer common: more verbose logging (with counts)Bryan Newbold2021-10-011-4/+4
* | datacite: skip empty abstractsMartin Czygan2021-10-011-1/+4
|/
* more consistent and defensive lower-casing of DOIsBryan Newbold2021-06-232-1/+6
* datacite: more careful title string access; fixes sentry #88350Martin Czygan2021-06-111-1/+1
* ingest: swap ingest and file checks, to result in clearer stats/counts of ski...Bryan Newbold2021-06-031-2/+2
* ingest: don't accept mag and s2 URLsBryan Newbold2021-06-031-4/+4
* small python lint fixes (no behavior change)Bryan Newbold2021-05-251-2/+0
* arabesque importer: ensure full 14-digit timestampsBryan Newbold2021-05-211-1/+3
* datacite: a missing surname should be None, not the empty stringMartin Czygan2021-04-021-2/+1
* web ingest: terminal URL mismatch as skip, not assertBryan Newbold2020-12-301-1/+3