aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/ingest.py
Commit message (Collapse)AuthorAgeFilesLines
* fileset ingest: handle missing/partial file-level metadataBryan Newbold2022-04-051-3/+3
|
* ingest importer: improved extra/edit_extra code flowBryan Newbold2022-04-051-20/+13
|
* fileset ingest: remove a TODOBryan Newbold2022-04-041-1/+0
|
* filesets: typo bugfix, and test 'mimetype' on entity, not extraBryan Newbold2022-04-041-1/+1
|
* fileset ingest: fix mimetype handlingBryan Newbold2022-03-311-4/+5
|
* bugfix: logic flow in fileset release checkingBryan Newbold2022-03-231-3/+6
|
* single-file variant of fileset importer for dataset attemptsBryan Newbold2022-03-231-0/+201
|
* ingest fileset fixes, and some test coverageBryan Newbold2022-03-231-13/+19
|
* dataset ingest: JSON object fixesBryan Newbold2022-03-221-5/+5
|
* typing: relatively simple type check fixesBryan Newbold2021-11-031-3/+4
| | | | | | | These mostly add new variable names so that existing variables aren't overwritten with a new type; delay coercing '{}' or '[]' to 'None' until the last minute; adding is-not-None checks to conditional clauses; and similar small changes.
* typing: initial annotations on importersBryan Newbold2021-11-031-35/+46
| | | | | This commit just adds the type annotations, doesn't do fixes to code to make type checking pass.
* fmt (black): fatcat_tools/Bryan Newbold2021-11-021-319/+374
|
* python: isort everythingBryan Newbold2021-11-021-0/+1
|
* lint: simple, safe inline lint fixesBryan Newbold2021-11-021-6/+6
| | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc
* fix missing variable in fileset ingestBryan Newbold2021-11-021-2/+1
|
* WIP: more fileset ingestBryan Newbold2021-10-181-13/+21
|
* WIP: rel fixesBryan Newbold2021-10-141-6/+6
|
* fileset ingest small tweaksBryan Newbold2021-10-141-21/+36
|
* initial implementation of fileset ingest importersBryan Newbold2021-10-141-2/+223
|
* new SPN web (html) importerBryan Newbold2021-10-011-26/+80
|
* ingest importer behavior tweaksBryan Newbold2021-10-011-8/+8
| | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN
* more consistent and defensive lower-casing of DOIsBryan Newbold2021-06-231-0/+4
| | | | | | | After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field.
* ingest: swap ingest and file checks, to result in clearer stats/counts of ↵Bryan Newbold2021-06-031-2/+2
| | | | skipping
* ingest: don't accept mag and s2 URLsBryan Newbold2021-06-031-4/+4
|
* web ingest: terminal URL mismatch as skip, not assertBryan Newbold2020-12-301-1/+3
|
* ingest: allow dblp importsBryan Newbold2020-12-231-1/+1
|
* add dblp as an ingest source and identifierBryan Newbold2020-12-171-1/+2
|
* ingest: allow doaj ingest responsesBryan Newbold2020-12-171-1/+2
|
* html ingest: small fixes to try_update() code pathBryan Newbold2020-12-151-5/+5
| | | | | Don't currently have test coverage for most try_update() code; run the inserts manually in testing.
* html ingest: actual xhtml mimetypeBryan Newbold2020-11-161-2/+2
|
* html ingest: remaining implementationBryan Newbold2020-11-061-22/+19
|
* ingest: progress on HTML ingestBryan Newbold2020-11-051-14/+30
|
* ingest: initial 'web' worker implementationBryan Newbold2020-11-051-66/+258
|
* ingest: whitelist -> allowlistBryan Newbold2020-11-051-3/+3
|
* ingest: basic checks for ingest_typeBryan Newbold2020-11-051-3/+29
|
* lint (flake8) tool python filesBryan Newbold2020-07-011-6/+1
|
* ingest importer: check that stage is consistent with releaseBryan Newbold2020-05-261-0/+5
|
* importers: clarify handling of ApiExceptionBryan Newbold2020-05-221-0/+1
| | | | | | | | One of these (in ingest importer pipeline) is an actual bug, the others are just changing the syntax to be more explicit/conservative. The ingest importer bug seems to have resulted in some bad file match imports; scale of impact is unknown.
* ingest importer: don't use glutton matchesBryan Newbold2020-05-221-3/+3
| | | | | | | Until reviewing I didn't realize we were even doing this currently. Hopefluly has not impacted too many imports, as almost all ingests use an external identifer, so only those with identifers not in fatcat for whatever reason.
* ingest import: fix edit_extra pathBryan Newbold2020-02-181-1/+1
|
* ingest importer: edit_extra is a top-level keyBryan Newbold2020-02-181-1/+1
|
* ingest import: allow short version of corpus namesBryan Newbold2020-02-181-0/+3
|
* ingest importer: pass through link relBryan Newbold2020-02-181-1/+6
|
* check ingest_request_source existance for SPN as well as ingestBryan Newbold2020-02-061-0/+3
|
* additional trusted link sourcesBryan Newbold2020-02-061-0/+3
|
* add mag and s2 as trusted link sourcesBryan Newbold2020-02-061-1/+1
|
* ingest worker: handle missing ingest_request_sourceBryan Newbold2020-02-061-0/+3
| | | | | Seeing a bunch of these due to re-ingests not including this field because of an earlier persist bug.
* fix trivial typo in file importerBryan Newbold2020-01-201-1/+1
|
* ingest: improve tests, support old ingest resultsBryan Newbold2020-01-151-3/+12
|
* update ingest worker for schema tweaksBryan Newbold2020-01-151-8/+15
| | | | | | Should be backwards compatible with old ingest results. Fixed a bug with glutton ident detection.