summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers
Commit message (Expand)AuthorAgeFilesLines
* WIP: more fileset ingestBryan Newbold2021-10-181-13/+21
* WIP: rel fixesBryan Newbold2021-10-141-6/+6
* fileset ingest small tweaksBryan Newbold2021-10-141-21/+36
* initial implementation of fileset ingest importersBryan Newbold2021-10-142-3/+224
* generic fileset importer class, with test coverageBryan Newbold2021-10-143-0/+88
* dblp import: basic support for handles as identifiersBryan Newbold2021-10-131-1/+5
* dblp import: fix typos in identifier parsingBryan Newbold2021-10-131-2/+1
* python: partial importer utilization of new schema changesBryan Newbold2021-10-133-6/+18
* Merge branch 'bnewbold-ingest-tweaks' into 'master'bnewbold2021-10-023-39/+106
|\
| * kafka import: optional 'force-flush' mode for some importersBryan Newbold2021-10-011-0/+13
| * new SPN web (html) importerBryan Newbold2021-10-012-27/+81
| * ingest importer behavior tweaksBryan Newbold2021-10-011-8/+8
| * importer common: more verbose logging (with counts)Bryan Newbold2021-10-011-4/+4
* | datacite: skip empty abstractsMartin Czygan2021-10-011-1/+4
|/
* more consistent and defensive lower-casing of DOIsBryan Newbold2021-06-232-1/+6
* datacite: more careful title string access; fixes sentry #88350Martin Czygan2021-06-111-1/+1
* ingest: swap ingest and file checks, to result in clearer stats/counts of ski...Bryan Newbold2021-06-031-2/+2
* ingest: don't accept mag and s2 URLsBryan Newbold2021-06-031-4/+4
* small python lint fixes (no behavior change)Bryan Newbold2021-05-251-2/+0
* arabesque importer: ensure full 14-digit timestampsBryan Newbold2021-05-211-1/+3
* datacite: a missing surname should be None, not the empty stringMartin Czygan2021-04-021-2/+1
* web ingest: terminal URL mismatch as skip, not assertBryan Newbold2020-12-301-1/+3
* dblp release import: skip arxiv_id releasesBryan Newbold2020-12-241-0/+9
* dblp import: fix arxiv_id typoBryan Newbold2020-12-231-1/+1
* ingest: allow dblp importsBryan Newbold2020-12-231-1/+1
* fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
* dblp: polish HTML scrape/extract pipelineBryan Newbold2020-12-171-0/+14
* dblp: flesh out update code path (especially to add container_id linkage)Bryan Newbold2020-12-171-2/+6
* dblp: run fuzzy matching at try_update time (same as DOAJ)Bryan Newbold2020-12-171-1/+8
* improve dblp release importBryan Newbold2020-12-171-1/+2
* very simple dblp container importerBryan Newbold2020-12-172-0/+145
* dblp release importer: container_id lookup TSV, and dump JSON modeBryan Newbold2020-12-171-10/+66
* initial implementation of dblp release importer (in progress)Bryan Newbold2020-12-172-0/+445
* add 'lxml' mode for large XML file import, and multi-tagsBryan Newbold2020-12-171-15/+28
* add dblp as an ingest source and identifierBryan Newbold2020-12-171-1/+2
* ingest: allow doaj ingest responsesBryan Newbold2020-12-171-1/+2
* update fuzzy helper to pass 'reason' through to import codeBryan Newbold2020-12-171-3/+3
* add fuzzy match filtering to DOAJ importerBryan Newbold2020-12-161-2/+9
* add fuzzy matching helper to importer base classBryan Newbold2020-12-161-2/+62
* html ingest: small fixes to try_update() code pathBryan Newbold2020-12-151-5/+5
* crossref+datacite: remove confusing early update bailBryan Newbold2020-11-202-4/+0
* doaj: fix update code path (getattr not __dict__)Bryan Newbold2020-11-201-4/+3
* DOAJ: handle empty identifier 'id' caseBryan Newbold2020-11-201-0/+2
* tweak DOAJ importer class args and default for do_updatesBryan Newbold2020-11-191-2/+2
* implement remainder of DOAJ article importerBryan Newbold2020-11-191-57/+125
* more python normalizers, and move from importer commonBryan Newbold2020-11-191-154/+4
* initial implementation of DOAJ importerBryan Newbold2020-11-192-0/+290
* html ingest: actual xhtml mimetypeBryan Newbold2020-11-161-2/+2
* html ingest: remaining implementationBryan Newbold2020-11-061-22/+19
* ingest: progress on HTML ingestBryan Newbold2020-11-051-14/+30