Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | imports: generic file cleanup removes exact duplicate URLs | Bryan Newbold | 2021-11-09 | 1 | -0/+9 |
| | |||||
* | datacite importer: remove unused 'year_only' variable | Bryan Newbold | 2021-11-03 | 1 | -2/+3 |
| | |||||
* | datacite: add comment about potential date parsing bug | Bryan Newbold | 2021-11-03 | 1 | -0/+1 |
| | |||||
* | datacite importer: dateparser.date.DateDataParser() | Bryan Newbold | 2021-11-03 | 1 | -1/+1 |
| | | | | Perhaps this was a change when upgrading 'dateparser'? | ||||
* | more involved type wrangling and fixes for importers | Bryan Newbold | 2021-11-03 | 3 | -12/+14 |
| | |||||
* | typing: relatively simple type check fixes | Bryan Newbold | 2021-11-03 | 14 | -87/+82 |
| | | | | | | | These mostly add new variable names so that existing variables aren't overwritten with a new type; delay coercing '{}' or '[]' to 'None' until the last minute; adding is-not-None checks to conditional clauses; and similar small changes. | ||||
* | typing: initial annotations on importers | Bryan Newbold | 2021-11-03 | 22 | -274/+443 |
| | | | | | This commit just adds the type annotations, doesn't do fixes to code to make type checking pass. | ||||
* | importers: remove unused __main__ routine | Bryan Newbold | 2021-11-03 | 4 | -19/+0 |
| | | | | | | These perhaps were used in initial develoment or testing? fatcat_import.py is the correct way to do these imports, even for testing/development. | ||||
* | lint: resolve existing mypy type errors | Bryan Newbold | 2021-11-02 | 3 | -22/+27 |
| | | | | | | | | | Adds annotations and re-workes dataflow to satisfy existing mypy issues, without adding any additional type annotations to, eg, function signatures. There will probably be many more type errors when annotations are all added. | ||||
* | re-fix some lint issues after big 'fmt' | Bryan Newbold | 2021-11-02 | 1 | -2/+2 |
| | |||||
* | fmt (black): fatcat_tools/ | Bryan Newbold | 2021-11-02 | 22 | -2115/+2578 |
| | |||||
* | python: isort everything | Bryan Newbold | 2021-11-02 | 17 | -41/+70 |
| | |||||
* | arabesque import 'hit' field is 1/0, not true/false | Bryan Newbold | 2021-11-02 | 1 | -2/+2 |
| | |||||
* | lint: simple, safe inline lint fixes | Bryan Newbold | 2021-11-02 | 12 | -22/+21 |
| | | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc | ||||
* | lint/fmt: remove all 'import *' | Bryan Newbold | 2021-11-02 | 5 | -21/+41 |
| | |||||
* | re-fmt all the fatcat_tools __init__ files for readability | Bryan Newbold | 2021-11-02 | 1 | -17/+39 |
| | |||||
* | small python tweaks for annotations, imports | Bryan Newbold | 2021-11-02 | 2 | -2/+6 |
| | |||||
* | try some type annotations | Bryan Newbold | 2021-11-02 | 2 | -55/+63 |
| | |||||
* | fix missing variable in fileset ingest | Bryan Newbold | 2021-11-02 | 1 | -2/+1 |
| | |||||
* | WIP: more fileset ingest | Bryan Newbold | 2021-10-18 | 1 | -13/+21 |
| | |||||
* | WIP: rel fixes | Bryan Newbold | 2021-10-14 | 1 | -6/+6 |
| | |||||
* | fileset ingest small tweaks | Bryan Newbold | 2021-10-14 | 1 | -21/+36 |
| | |||||
* | initial implementation of fileset ingest importers | Bryan Newbold | 2021-10-14 | 2 | -3/+224 |
| | |||||
* | generic fileset importer class, with test coverage | Bryan Newbold | 2021-10-14 | 3 | -0/+88 |
| | |||||
* | dblp import: basic support for handles as identifiers | Bryan Newbold | 2021-10-13 | 1 | -1/+5 |
| | |||||
* | dblp import: fix typos in identifier parsing | Bryan Newbold | 2021-10-13 | 1 | -2/+1 |
| | |||||
* | python: partial importer utilization of new schema changes | Bryan Newbold | 2021-10-13 | 3 | -6/+18 |
| | |||||
* | Merge branch 'bnewbold-ingest-tweaks' into 'master' | bnewbold | 2021-10-02 | 3 | -39/+106 |
|\ | | | | | | | | | ingest importer behavior tweaks See merge request webgroup/fatcat!120 | ||||
| * | kafka import: optional 'force-flush' mode for some importers | Bryan Newbold | 2021-10-01 | 1 | -0/+13 |
| | | | | | | | | Behavior and motivation described in the kafka json import comment. | ||||
| * | new SPN web (html) importer | Bryan Newbold | 2021-10-01 | 2 | -27/+81 |
| | | |||||
| * | ingest importer behavior tweaks | Bryan Newbold | 2021-10-01 | 1 | -8/+8 |
| | | | | | | | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN | ||||
| * | importer common: more verbose logging (with counts) | Bryan Newbold | 2021-10-01 | 1 | -4/+4 |
| | | |||||
* | | datacite: skip empty abstracts | Martin Czygan | 2021-10-01 | 1 | -1/+4 |
|/ | | | | | Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required` | ||||
* | more consistent and defensive lower-casing of DOIs | Bryan Newbold | 2021-06-23 | 2 | -1/+6 |
| | | | | | | | After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field. | ||||
* | datacite: more careful title string access; fixes sentry #88350 | Martin Czygan | 2021-06-11 | 1 | -1/+1 |
| | | | | | Caused by a partial "title entry without title" coming *first* (e.g. just holding, e.g. a language, like: {'lang': 'da'} | ||||
* | ingest: swap ingest and file checks, to result in clearer stats/counts of ↵ | Bryan Newbold | 2021-06-03 | 1 | -2/+2 |
| | | | | skipping | ||||
* | ingest: don't accept mag and s2 URLs | Bryan Newbold | 2021-06-03 | 1 | -4/+4 |
| | |||||
* | small python lint fixes (no behavior change) | Bryan Newbold | 2021-05-25 | 1 | -2/+0 |
| | |||||
* | arabesque importer: ensure full 14-digit timestamps | Bryan Newbold | 2021-05-21 | 1 | -1/+3 |
| | |||||
* | datacite: a missing surname should be None, not the empty string | Martin Czygan | 2021-04-02 | 1 | -2/+1 |
| | | | | refs sentry #77700 | ||||
* | web ingest: terminal URL mismatch as skip, not assert | Bryan Newbold | 2020-12-30 | 1 | -1/+3 |
| | |||||
* | dblp release import: skip arxiv_id releases | Bryan Newbold | 2020-12-24 | 1 | -0/+9 |
| | |||||
* | dblp import: fix arxiv_id typo | Bryan Newbold | 2020-12-23 | 1 | -1/+1 |
| | | | | Would have been caught by mypy! | ||||
* | ingest: allow dblp imports | Bryan Newbold | 2020-12-23 | 1 | -1/+1 |
| | |||||
* | fuzzy: set 120 second timeout on ES lookups | Bryan Newbold | 2020-12-23 | 1 | -1/+1 |
| | |||||
* | dblp: polish HTML scrape/extract pipeline | Bryan Newbold | 2020-12-17 | 1 | -0/+14 |
| | |||||
* | dblp: flesh out update code path (especially to add container_id linkage) | Bryan Newbold | 2020-12-17 | 1 | -2/+6 |
| | |||||
* | dblp: run fuzzy matching at try_update time (same as DOAJ) | Bryan Newbold | 2020-12-17 | 1 | -1/+8 |
| | |||||
* | improve dblp release import | Bryan Newbold | 2020-12-17 | 1 | -1/+2 |
| | |||||
* | very simple dblp container importer | Bryan Newbold | 2020-12-17 | 2 | -0/+145 |
| |