summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/ingest.py
Commit message (Collapse)AuthorAgeFilesLines
* new SPN web (html) importerBryan Newbold2021-10-011-26/+80
|
* ingest importer behavior tweaksBryan Newbold2021-10-011-8/+8
| | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN
* more consistent and defensive lower-casing of DOIsBryan Newbold2021-06-231-0/+4
| | | | | | | After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field.
* ingest: swap ingest and file checks, to result in clearer stats/counts of ↵Bryan Newbold2021-06-031-2/+2
| | | | skipping
* ingest: don't accept mag and s2 URLsBryan Newbold2021-06-031-4/+4
|
* web ingest: terminal URL mismatch as skip, not assertBryan Newbold2020-12-301-1/+3
|
* ingest: allow dblp importsBryan Newbold2020-12-231-1/+1
|
* add dblp as an ingest source and identifierBryan Newbold2020-12-171-1/+2
|
* ingest: allow doaj ingest responsesBryan Newbold2020-12-171-1/+2
|
* html ingest: small fixes to try_update() code pathBryan Newbold2020-12-151-5/+5
| | | | | Don't currently have test coverage for most try_update() code; run the inserts manually in testing.
* html ingest: actual xhtml mimetypeBryan Newbold2020-11-161-2/+2
|
* html ingest: remaining implementationBryan Newbold2020-11-061-22/+19
|
* ingest: progress on HTML ingestBryan Newbold2020-11-051-14/+30
|
* ingest: initial 'web' worker implementationBryan Newbold2020-11-051-66/+258
|
* ingest: whitelist -> allowlistBryan Newbold2020-11-051-3/+3
|
* ingest: basic checks for ingest_typeBryan Newbold2020-11-051-3/+29
|
* lint (flake8) tool python filesBryan Newbold2020-07-011-6/+1
|
* ingest importer: check that stage is consistent with releaseBryan Newbold2020-05-261-0/+5
|
* importers: clarify handling of ApiExceptionBryan Newbold2020-05-221-0/+1
| | | | | | | | One of these (in ingest importer pipeline) is an actual bug, the others are just changing the syntax to be more explicit/conservative. The ingest importer bug seems to have resulted in some bad file match imports; scale of impact is unknown.
* ingest importer: don't use glutton matchesBryan Newbold2020-05-221-3/+3
| | | | | | | Until reviewing I didn't realize we were even doing this currently. Hopefluly has not impacted too many imports, as almost all ingests use an external identifer, so only those with identifers not in fatcat for whatever reason.
* ingest import: fix edit_extra pathBryan Newbold2020-02-181-1/+1
|
* ingest importer: edit_extra is a top-level keyBryan Newbold2020-02-181-1/+1
|
* ingest import: allow short version of corpus namesBryan Newbold2020-02-181-0/+3
|
* ingest importer: pass through link relBryan Newbold2020-02-181-1/+6
|
* check ingest_request_source existance for SPN as well as ingestBryan Newbold2020-02-061-0/+3
|
* additional trusted link sourcesBryan Newbold2020-02-061-0/+3
|
* add mag and s2 as trusted link sourcesBryan Newbold2020-02-061-1/+1
|
* ingest worker: handle missing ingest_request_sourceBryan Newbold2020-02-061-0/+3
| | | | | Seeing a bunch of these due to re-ingests not including this field because of an earlier persist bug.
* fix trivial typo in file importerBryan Newbold2020-01-201-1/+1
|
* ingest: improve tests, support old ingest resultsBryan Newbold2020-01-151-3/+12
|
* update ingest worker for schema tweaksBryan Newbold2020-01-151-8/+15
| | | | | | Should be backwards compatible with old ingest results. Fixed a bug with glutton ident detection.
* ingest: allow more sources to auto-importBryan Newbold2020-01-151-1/+2
|
* importers: control update behavior with more-standard flagBryan Newbold2020-01-061-1/+1
|
* allow arabesque backfill ingests for some source typesBryan Newbold2019-12-241-0/+5
|
* fix spn/ingest importer duplication checkBryan Newbold2019-12-221-6/+8
| | | | | | Check was happing after the `return True` by mistake, allowing duplicates in SPN editgroups, and potentially in ingest request editgroups as well.
* add ingest import file collision protectionBryan Newbold2019-12-131-0/+6
| | | | | | | | The common case is the same URL being submitted repeatedly during testing. This is only within-editgroup, and per importer (eg, won't work across spn importer "submitted" editgroups), but is better than nothing.
* update ingest request schemaBryan Newbold2019-12-131-2/+7
| | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
* remove default mimetype from ingest-file importerBryan Newbold2019-12-131-2/+1
| | | | We really should just use file_meta result or nothing.
* savepapernow result importerBryan Newbold2019-12-121-3/+64
| | | | Based on ingest-file-results importer
* add another ingest request source to whitelistBryan Newbold2019-12-101-2/+5
|
* tweaks to file ingest importerBryan Newbold2019-12-031-3/+4
| | | | | - allow overriding source filter whitelist (common case for CLI use) - fix editgroup description env variable pass-through
* re-order ingest want() for better statsBryan Newbold2019-11-151-7/+10
|
* project -> ingest_request_sourceBryan Newbold2019-11-151-6/+6
|
* ingest importer fixesBryan Newbold2019-11-151-3/+4
|
* more ingest importer comments and countsBryan Newbold2019-11-151-1/+28
|
* ingest file result importerBryan Newbold2019-11-151-0/+134