Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | WIP: rel fixes | Bryan Newbold | 2021-10-14 | 1 | -6/+6 |
| | |||||
* | fileset ingest small tweaks | Bryan Newbold | 2021-10-14 | 1 | -21/+36 |
| | |||||
* | initial implementation of fileset ingest importers | Bryan Newbold | 2021-10-14 | 1 | -2/+223 |
| | |||||
* | new SPN web (html) importer | Bryan Newbold | 2021-10-01 | 1 | -26/+80 |
| | |||||
* | ingest importer behavior tweaks | Bryan Newbold | 2021-10-01 | 1 | -8/+8 |
| | | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN | ||||
* | more consistent and defensive lower-casing of DOIs | Bryan Newbold | 2021-06-23 | 1 | -0/+4 |
| | | | | | | | After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field. | ||||
* | ingest: swap ingest and file checks, to result in clearer stats/counts of ↵ | Bryan Newbold | 2021-06-03 | 1 | -2/+2 |
| | | | | skipping | ||||
* | ingest: don't accept mag and s2 URLs | Bryan Newbold | 2021-06-03 | 1 | -4/+4 |
| | |||||
* | web ingest: terminal URL mismatch as skip, not assert | Bryan Newbold | 2020-12-30 | 1 | -1/+3 |
| | |||||
* | ingest: allow dblp imports | Bryan Newbold | 2020-12-23 | 1 | -1/+1 |
| | |||||
* | add dblp as an ingest source and identifier | Bryan Newbold | 2020-12-17 | 1 | -1/+2 |
| | |||||
* | ingest: allow doaj ingest responses | Bryan Newbold | 2020-12-17 | 1 | -1/+2 |
| | |||||
* | html ingest: small fixes to try_update() code path | Bryan Newbold | 2020-12-15 | 1 | -5/+5 |
| | | | | | Don't currently have test coverage for most try_update() code; run the inserts manually in testing. | ||||
* | html ingest: actual xhtml mimetype | Bryan Newbold | 2020-11-16 | 1 | -2/+2 |
| | |||||
* | html ingest: remaining implementation | Bryan Newbold | 2020-11-06 | 1 | -22/+19 |
| | |||||
* | ingest: progress on HTML ingest | Bryan Newbold | 2020-11-05 | 1 | -14/+30 |
| | |||||
* | ingest: initial 'web' worker implementation | Bryan Newbold | 2020-11-05 | 1 | -66/+258 |
| | |||||
* | ingest: whitelist -> allowlist | Bryan Newbold | 2020-11-05 | 1 | -3/+3 |
| | |||||
* | ingest: basic checks for ingest_type | Bryan Newbold | 2020-11-05 | 1 | -3/+29 |
| | |||||
* | lint (flake8) tool python files | Bryan Newbold | 2020-07-01 | 1 | -6/+1 |
| | |||||
* | ingest importer: check that stage is consistent with release | Bryan Newbold | 2020-05-26 | 1 | -0/+5 |
| | |||||
* | importers: clarify handling of ApiException | Bryan Newbold | 2020-05-22 | 1 | -0/+1 |
| | | | | | | | | One of these (in ingest importer pipeline) is an actual bug, the others are just changing the syntax to be more explicit/conservative. The ingest importer bug seems to have resulted in some bad file match imports; scale of impact is unknown. | ||||
* | ingest importer: don't use glutton matches | Bryan Newbold | 2020-05-22 | 1 | -3/+3 |
| | | | | | | | Until reviewing I didn't realize we were even doing this currently. Hopefluly has not impacted too many imports, as almost all ingests use an external identifer, so only those with identifers not in fatcat for whatever reason. | ||||
* | ingest import: fix edit_extra path | Bryan Newbold | 2020-02-18 | 1 | -1/+1 |
| | |||||
* | ingest importer: edit_extra is a top-level key | Bryan Newbold | 2020-02-18 | 1 | -1/+1 |
| | |||||
* | ingest import: allow short version of corpus names | Bryan Newbold | 2020-02-18 | 1 | -0/+3 |
| | |||||
* | ingest importer: pass through link rel | Bryan Newbold | 2020-02-18 | 1 | -1/+6 |
| | |||||
* | check ingest_request_source existance for SPN as well as ingest | Bryan Newbold | 2020-02-06 | 1 | -0/+3 |
| | |||||
* | additional trusted link sources | Bryan Newbold | 2020-02-06 | 1 | -0/+3 |
| | |||||
* | add mag and s2 as trusted link sources | Bryan Newbold | 2020-02-06 | 1 | -1/+1 |
| | |||||
* | ingest worker: handle missing ingest_request_source | Bryan Newbold | 2020-02-06 | 1 | -0/+3 |
| | | | | | Seeing a bunch of these due to re-ingests not including this field because of an earlier persist bug. | ||||
* | fix trivial typo in file importer | Bryan Newbold | 2020-01-20 | 1 | -1/+1 |
| | |||||
* | ingest: improve tests, support old ingest results | Bryan Newbold | 2020-01-15 | 1 | -3/+12 |
| | |||||
* | update ingest worker for schema tweaks | Bryan Newbold | 2020-01-15 | 1 | -8/+15 |
| | | | | | | Should be backwards compatible with old ingest results. Fixed a bug with glutton ident detection. | ||||
* | ingest: allow more sources to auto-import | Bryan Newbold | 2020-01-15 | 1 | -1/+2 |
| | |||||
* | importers: control update behavior with more-standard flag | Bryan Newbold | 2020-01-06 | 1 | -1/+1 |
| | |||||
* | allow arabesque backfill ingests for some source types | Bryan Newbold | 2019-12-24 | 1 | -0/+5 |
| | |||||
* | fix spn/ingest importer duplication check | Bryan Newbold | 2019-12-22 | 1 | -6/+8 |
| | | | | | | Check was happing after the `return True` by mistake, allowing duplicates in SPN editgroups, and potentially in ingest request editgroups as well. | ||||
* | add ingest import file collision protection | Bryan Newbold | 2019-12-13 | 1 | -0/+6 |
| | | | | | | | | The common case is the same URL being submitted repeatedly during testing. This is only within-editgroup, and per importer (eg, won't work across spn importer "submitted" editgroups), but is better than nothing. | ||||
* | update ingest request schema | Bryan Newbold | 2019-12-13 | 1 | -2/+7 |
| | | | | | This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups. | ||||
* | remove default mimetype from ingest-file importer | Bryan Newbold | 2019-12-13 | 1 | -2/+1 |
| | | | | We really should just use file_meta result or nothing. | ||||
* | savepapernow result importer | Bryan Newbold | 2019-12-12 | 1 | -3/+64 |
| | | | | Based on ingest-file-results importer | ||||
* | add another ingest request source to whitelist | Bryan Newbold | 2019-12-10 | 1 | -2/+5 |
| | |||||
* | tweaks to file ingest importer | Bryan Newbold | 2019-12-03 | 1 | -3/+4 |
| | | | | | - allow overriding source filter whitelist (common case for CLI use) - fix editgroup description env variable pass-through | ||||
* | re-order ingest want() for better stats | Bryan Newbold | 2019-11-15 | 1 | -7/+10 |
| | |||||
* | project -> ingest_request_source | Bryan Newbold | 2019-11-15 | 1 | -6/+6 |
| | |||||
* | ingest importer fixes | Bryan Newbold | 2019-11-15 | 1 | -3/+4 |
| | |||||
* | more ingest importer comments and counts | Bryan Newbold | 2019-11-15 | 1 | -1/+28 |
| | |||||
* | ingest file result importer | Bryan Newbold | 2019-11-15 | 1 | -0/+134 |