summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'bnewbold-ingest-tweaks' into 'master'bnewbold2021-10-023-39/+106
|\ | | | | | | | | ingest importer behavior tweaks See merge request webgroup/fatcat!120
| * kafka import: optional 'force-flush' mode for some importersBryan Newbold2021-10-011-0/+13
| | | | | | | | Behavior and motivation described in the kafka json import comment.
| * new SPN web (html) importerBryan Newbold2021-10-012-27/+81
| |
| * ingest importer behavior tweaksBryan Newbold2021-10-011-8/+8
| | | | | | | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN
| * importer common: more verbose logging (with counts)Bryan Newbold2021-10-011-4/+4
| |
* | datacite: skip empty abstractsMartin Czygan2021-10-011-1/+4
|/ | | | | Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required`
* more consistent and defensive lower-casing of DOIsBryan Newbold2021-06-232-1/+6
| | | | | | | After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field.
* datacite: more careful title string access; fixes sentry #88350Martin Czygan2021-06-111-1/+1
| | | | | Caused by a partial "title entry without title" coming *first* (e.g. just holding, e.g. a language, like: {'lang': 'da'}
* ingest: swap ingest and file checks, to result in clearer stats/counts of ↵Bryan Newbold2021-06-031-2/+2
| | | | skipping
* ingest: don't accept mag and s2 URLsBryan Newbold2021-06-031-4/+4
|
* small python lint fixes (no behavior change)Bryan Newbold2021-05-251-2/+0
|
* arabesque importer: ensure full 14-digit timestampsBryan Newbold2021-05-211-1/+3
|
* datacite: a missing surname should be None, not the empty stringMartin Czygan2021-04-021-2/+1
| | | | refs sentry #77700
* web ingest: terminal URL mismatch as skip, not assertBryan Newbold2020-12-301-1/+3
|
* dblp release import: skip arxiv_id releasesBryan Newbold2020-12-241-0/+9
|
* dblp import: fix arxiv_id typoBryan Newbold2020-12-231-1/+1
| | | | Would have been caught by mypy!
* ingest: allow dblp importsBryan Newbold2020-12-231-1/+1
|
* fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
|
* dblp: polish HTML scrape/extract pipelineBryan Newbold2020-12-171-0/+14
|
* dblp: flesh out update code path (especially to add container_id linkage)Bryan Newbold2020-12-171-2/+6
|
* dblp: run fuzzy matching at try_update time (same as DOAJ)Bryan Newbold2020-12-171-1/+8
|
* improve dblp release importBryan Newbold2020-12-171-1/+2
|
* very simple dblp container importerBryan Newbold2020-12-172-0/+145
|
* dblp release importer: container_id lookup TSV, and dump JSON modeBryan Newbold2020-12-171-10/+66
|
* initial implementation of dblp release importer (in progress)Bryan Newbold2020-12-172-0/+445
|
* add 'lxml' mode for large XML file import, and multi-tagsBryan Newbold2020-12-171-15/+28
|
* add dblp as an ingest source and identifierBryan Newbold2020-12-171-1/+2
|
* ingest: allow doaj ingest responsesBryan Newbold2020-12-171-1/+2
|
* update fuzzy helper to pass 'reason' through to import codeBryan Newbold2020-12-171-3/+3
| | | | | The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases.
* add fuzzy match filtering to DOAJ importerBryan Newbold2020-12-161-2/+9
| | | | | | | | | | | In this default configuration, any entities with a fuzzy match (even "ambiguous") will be skipped at import time, to prevent creating duplicates. This is conservative towards not creating new/duplicate entities. In the future, as we get more confidence in fuzzy match/verification, we can start to ignore AMBIGUOUS, handle EXACT as same release, and merge STRONG (and WEAK?) matches under the same work entity.
* add fuzzy matching helper to importer base classBryan Newbold2020-12-161-2/+62
| | | | Using fuzzycat. Add basic test coverage.
* html ingest: small fixes to try_update() code pathBryan Newbold2020-12-151-5/+5
| | | | | Don't currently have test coverage for most try_update() code; run the inserts manually in testing.
* crossref+datacite: remove confusing early update bailBryan Newbold2020-11-202-4/+0
| | | | | Easy to miss that we skip updates *twice*, and with this early bailout were not updating counts correctly.
* doaj: fix update code path (getattr not __dict__)Bryan Newbold2020-11-201-4/+3
| | | | Also add missing code coverage for update path (disabled by default).
* DOAJ: handle empty identifier 'id' caseBryan Newbold2020-11-201-0/+2
|
* tweak DOAJ importer class args and default for do_updatesBryan Newbold2020-11-191-2/+2
|
* implement remainder of DOAJ article importerBryan Newbold2020-11-191-57/+125
|
* more python normalizers, and move from importer commonBryan Newbold2020-11-191-154/+4
| | | | | | | | | | | | Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports.
* initial implementation of DOAJ importerBryan Newbold2020-11-192-0/+290
| | | | Several things to finish implementing and polish.
* html ingest: actual xhtml mimetypeBryan Newbold2020-11-161-2/+2
|
* html ingest: remaining implementationBryan Newbold2020-11-061-22/+19
|
* ingest: progress on HTML ingestBryan Newbold2020-11-051-14/+30
|
* ingest: initial 'web' worker implementationBryan Newbold2020-11-052-67/+259
|
* refactor: white/black -> allow/blockBryan Newbold2020-11-051-4/+4
|
* ingest: whitelist -> allowlistBryan Newbold2020-11-051-3/+3
|
* ingest: basic checks for ingest_typeBryan Newbold2020-11-051-3/+29
|
* chocula importer: small tweaks to update behaviorBryan Newbold2020-10-081-8/+6
|
* address spammy datacite titlesMartin Czygan2020-09-231-0/+19
| | | | | | | | | seemingly from zenodo: * https://fatcat.wiki/release/rzcpjwukobd4pj36ipla22cnoi * https://doi.org/10.5281/zenodo.4041777 About 3400 records with "FULL MOVIE" in title, currently.
* datacite: handle case of empty-string versionBryan Newbold2020-09-101-1/+1
| | | | | Includes a tiny tweak to the datacite import sample file to test this code path.
* remove spurious print statementBryan Newbold2020-09-031-1/+0
|