Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | python: isort everything | Bryan Newbold | 2021-11-02 | 17 | -41/+70 |
| | |||||
* | arabesque import 'hit' field is 1/0, not true/false | Bryan Newbold | 2021-11-02 | 1 | -2/+2 |
| | |||||
* | lint: simple, safe inline lint fixes | Bryan Newbold | 2021-11-02 | 12 | -22/+21 |
| | | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc | ||||
* | lint/fmt: remove all 'import *' | Bryan Newbold | 2021-11-02 | 5 | -21/+41 |
| | |||||
* | re-fmt all the fatcat_tools __init__ files for readability | Bryan Newbold | 2021-11-02 | 1 | -17/+39 |
| | |||||
* | small python tweaks for annotations, imports | Bryan Newbold | 2021-11-02 | 2 | -2/+6 |
| | |||||
* | try some type annotations | Bryan Newbold | 2021-11-02 | 2 | -55/+63 |
| | |||||
* | fix missing variable in fileset ingest | Bryan Newbold | 2021-11-02 | 1 | -2/+1 |
| | |||||
* | WIP: more fileset ingest | Bryan Newbold | 2021-10-18 | 1 | -13/+21 |
| | |||||
* | WIP: rel fixes | Bryan Newbold | 2021-10-14 | 1 | -6/+6 |
| | |||||
* | fileset ingest small tweaks | Bryan Newbold | 2021-10-14 | 1 | -21/+36 |
| | |||||
* | initial implementation of fileset ingest importers | Bryan Newbold | 2021-10-14 | 2 | -3/+224 |
| | |||||
* | generic fileset importer class, with test coverage | Bryan Newbold | 2021-10-14 | 3 | -0/+88 |
| | |||||
* | dblp import: basic support for handles as identifiers | Bryan Newbold | 2021-10-13 | 1 | -1/+5 |
| | |||||
* | dblp import: fix typos in identifier parsing | Bryan Newbold | 2021-10-13 | 1 | -2/+1 |
| | |||||
* | python: partial importer utilization of new schema changes | Bryan Newbold | 2021-10-13 | 3 | -6/+18 |
| | |||||
* | Merge branch 'bnewbold-ingest-tweaks' into 'master' | bnewbold | 2021-10-02 | 3 | -39/+106 |
|\ | | | | | | | | | ingest importer behavior tweaks See merge request webgroup/fatcat!120 | ||||
| * | kafka import: optional 'force-flush' mode for some importers | Bryan Newbold | 2021-10-01 | 1 | -0/+13 |
| | | | | | | | | Behavior and motivation described in the kafka json import comment. | ||||
| * | new SPN web (html) importer | Bryan Newbold | 2021-10-01 | 2 | -27/+81 |
| | | |||||
| * | ingest importer behavior tweaks | Bryan Newbold | 2021-10-01 | 1 | -8/+8 |
| | | | | | | | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN | ||||
| * | importer common: more verbose logging (with counts) | Bryan Newbold | 2021-10-01 | 1 | -4/+4 |
| | | |||||
* | | datacite: skip empty abstracts | Martin Czygan | 2021-10-01 | 1 | -1/+4 |
|/ | | | | | Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required` | ||||
* | more consistent and defensive lower-casing of DOIs | Bryan Newbold | 2021-06-23 | 2 | -1/+6 |
| | | | | | | | After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field. | ||||
* | datacite: more careful title string access; fixes sentry #88350 | Martin Czygan | 2021-06-11 | 1 | -1/+1 |
| | | | | | Caused by a partial "title entry without title" coming *first* (e.g. just holding, e.g. a language, like: {'lang': 'da'} | ||||
* | ingest: swap ingest and file checks, to result in clearer stats/counts of ↵ | Bryan Newbold | 2021-06-03 | 1 | -2/+2 |
| | | | | skipping | ||||
* | ingest: don't accept mag and s2 URLs | Bryan Newbold | 2021-06-03 | 1 | -4/+4 |
| | |||||
* | small python lint fixes (no behavior change) | Bryan Newbold | 2021-05-25 | 1 | -2/+0 |
| | |||||
* | arabesque importer: ensure full 14-digit timestamps | Bryan Newbold | 2021-05-21 | 1 | -1/+3 |
| | |||||
* | datacite: a missing surname should be None, not the empty string | Martin Czygan | 2021-04-02 | 1 | -2/+1 |
| | | | | refs sentry #77700 | ||||
* | web ingest: terminal URL mismatch as skip, not assert | Bryan Newbold | 2020-12-30 | 1 | -1/+3 |
| | |||||
* | dblp release import: skip arxiv_id releases | Bryan Newbold | 2020-12-24 | 1 | -0/+9 |
| | |||||
* | dblp import: fix arxiv_id typo | Bryan Newbold | 2020-12-23 | 1 | -1/+1 |
| | | | | Would have been caught by mypy! | ||||
* | ingest: allow dblp imports | Bryan Newbold | 2020-12-23 | 1 | -1/+1 |
| | |||||
* | fuzzy: set 120 second timeout on ES lookups | Bryan Newbold | 2020-12-23 | 1 | -1/+1 |
| | |||||
* | dblp: polish HTML scrape/extract pipeline | Bryan Newbold | 2020-12-17 | 1 | -0/+14 |
| | |||||
* | dblp: flesh out update code path (especially to add container_id linkage) | Bryan Newbold | 2020-12-17 | 1 | -2/+6 |
| | |||||
* | dblp: run fuzzy matching at try_update time (same as DOAJ) | Bryan Newbold | 2020-12-17 | 1 | -1/+8 |
| | |||||
* | improve dblp release import | Bryan Newbold | 2020-12-17 | 1 | -1/+2 |
| | |||||
* | very simple dblp container importer | Bryan Newbold | 2020-12-17 | 2 | -0/+145 |
| | |||||
* | dblp release importer: container_id lookup TSV, and dump JSON mode | Bryan Newbold | 2020-12-17 | 1 | -10/+66 |
| | |||||
* | initial implementation of dblp release importer (in progress) | Bryan Newbold | 2020-12-17 | 2 | -0/+445 |
| | |||||
* | add 'lxml' mode for large XML file import, and multi-tags | Bryan Newbold | 2020-12-17 | 1 | -15/+28 |
| | |||||
* | add dblp as an ingest source and identifier | Bryan Newbold | 2020-12-17 | 1 | -1/+2 |
| | |||||
* | ingest: allow doaj ingest responses | Bryan Newbold | 2020-12-17 | 1 | -1/+2 |
| | |||||
* | update fuzzy helper to pass 'reason' through to import code | Bryan Newbold | 2020-12-17 | 1 | -3/+3 |
| | | | | | The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases. | ||||
* | add fuzzy match filtering to DOAJ importer | Bryan Newbold | 2020-12-16 | 1 | -2/+9 |
| | | | | | | | | | | | In this default configuration, any entities with a fuzzy match (even "ambiguous") will be skipped at import time, to prevent creating duplicates. This is conservative towards not creating new/duplicate entities. In the future, as we get more confidence in fuzzy match/verification, we can start to ignore AMBIGUOUS, handle EXACT as same release, and merge STRONG (and WEAK?) matches under the same work entity. | ||||
* | add fuzzy matching helper to importer base class | Bryan Newbold | 2020-12-16 | 1 | -2/+62 |
| | | | | Using fuzzycat. Add basic test coverage. | ||||
* | html ingest: small fixes to try_update() code path | Bryan Newbold | 2020-12-15 | 1 | -5/+5 |
| | | | | | Don't currently have test coverage for most try_update() code; run the inserts manually in testing. | ||||
* | crossref+datacite: remove confusing early update bail | Bryan Newbold | 2020-11-20 | 2 | -4/+0 |
| | | | | | Easy to miss that we skip updates *twice*, and with this early bailout were not updating counts correctly. | ||||
* | doaj: fix update code path (getattr not __dict__) | Bryan Newbold | 2020-11-20 | 1 | -4/+3 |
| | | | | Also add missing code coverage for update path (disabled by default). |