| Commit message (Collapse) | Author | Age | Files | Lines | |
|---|---|---|---|---|---|
| * | datacite importer: remove unused 'year_only' variable | Bryan Newbold | 2021-11-03 | 1 | -2/+3 | 
| | | |||||
| * | datacite: add comment about potential date parsing bug | Bryan Newbold | 2021-11-03 | 1 | -0/+1 | 
| | | |||||
| * | datacite importer: dateparser.date.DateDataParser() | Bryan Newbold | 2021-11-03 | 1 | -1/+1 | 
| | | | | | Perhaps this was a change when upgrading 'dateparser'? | ||||
| * | more involved type wrangling and fixes for importers | Bryan Newbold | 2021-11-03 | 3 | -12/+14 | 
| | | |||||
| * | typing: relatively simple type check fixes | Bryan Newbold | 2021-11-03 | 14 | -87/+82 | 
| | | | | | | | | These mostly add new variable names so that existing variables aren't overwritten with a new type; delay coercing '{}' or '[]' to 'None' until the last minute; adding is-not-None checks to conditional clauses; and similar small changes. | ||||
| * | typing: initial annotations on importers | Bryan Newbold | 2021-11-03 | 22 | -274/+443 | 
| | | | | | | This commit just adds the type annotations, doesn't do fixes to code to make type checking pass. | ||||
| * | importers: remove unused __main__ routine | Bryan Newbold | 2021-11-03 | 4 | -19/+0 | 
| | | | | | | | These perhaps were used in initial develoment or testing? fatcat_import.py is the correct way to do these imports, even for testing/development. | ||||
| * | lint: resolve existing mypy type errors | Bryan Newbold | 2021-11-02 | 3 | -22/+27 | 
| | | | | | | | | | | Adds annotations and re-workes dataflow to satisfy existing mypy issues, without adding any additional type annotations to, eg, function signatures. There will probably be many more type errors when annotations are all added. | ||||
| * | re-fix some lint issues after big 'fmt' | Bryan Newbold | 2021-11-02 | 1 | -2/+2 | 
| | | |||||
| * | fmt (black): fatcat_tools/ | Bryan Newbold | 2021-11-02 | 22 | -2115/+2578 | 
| | | |||||
| * | python: isort everything | Bryan Newbold | 2021-11-02 | 17 | -41/+70 | 
| | | |||||
| * | arabesque import 'hit' field is 1/0, not true/false | Bryan Newbold | 2021-11-02 | 1 | -2/+2 | 
| | | |||||
| * | lint: simple, safe inline lint fixes | Bryan Newbold | 2021-11-02 | 12 | -22/+21 | 
| | | | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc | ||||
| * | lint/fmt: remove all 'import *' | Bryan Newbold | 2021-11-02 | 5 | -21/+41 | 
| | | |||||
| * | re-fmt all the fatcat_tools __init__ files for readability | Bryan Newbold | 2021-11-02 | 1 | -17/+39 | 
| | | |||||
| * | small python tweaks for annotations, imports | Bryan Newbold | 2021-11-02 | 2 | -2/+6 | 
| | | |||||
| * | try some type annotations | Bryan Newbold | 2021-11-02 | 2 | -55/+63 | 
| | | |||||
| * | fix missing variable in fileset ingest | Bryan Newbold | 2021-11-02 | 1 | -2/+1 | 
| | | |||||
| * | WIP: more fileset ingest | Bryan Newbold | 2021-10-18 | 1 | -13/+21 | 
| | | |||||
| * | WIP: rel fixes | Bryan Newbold | 2021-10-14 | 1 | -6/+6 | 
| | | |||||
| * | fileset ingest small tweaks | Bryan Newbold | 2021-10-14 | 1 | -21/+36 | 
| | | |||||
| * | initial implementation of fileset ingest importers | Bryan Newbold | 2021-10-14 | 2 | -3/+224 | 
| | | |||||
| * | generic fileset importer class, with test coverage | Bryan Newbold | 2021-10-14 | 3 | -0/+88 | 
| | | |||||
| * | dblp import: basic support for handles as identifiers | Bryan Newbold | 2021-10-13 | 1 | -1/+5 | 
| | | |||||
| * | dblp import: fix typos in identifier parsing | Bryan Newbold | 2021-10-13 | 1 | -2/+1 | 
| | | |||||
| * | python: partial importer utilization of new schema changes | Bryan Newbold | 2021-10-13 | 3 | -6/+18 | 
| | | |||||
| * | Merge branch 'bnewbold-ingest-tweaks' into 'master' | bnewbold | 2021-10-02 | 3 | -39/+106 | 
| |\ | | | | | | | | | ingest importer behavior tweaks See merge request webgroup/fatcat!120 | ||||
| | * | kafka import: optional 'force-flush' mode for some importers | Bryan Newbold | 2021-10-01 | 1 | -0/+13 | 
| | | | | | | | | | Behavior and motivation described in the kafka json import comment. | ||||
| | * | new SPN web (html) importer | Bryan Newbold | 2021-10-01 | 2 | -27/+81 | 
| | | | |||||
| | * | ingest importer behavior tweaks | Bryan Newbold | 2021-10-01 | 1 | -8/+8 | 
| | | | | | | | | | | | - change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN | ||||
| | * | importer common: more verbose logging (with counts) | Bryan Newbold | 2021-10-01 | 1 | -4/+4 | 
| | | | |||||
| * | | datacite: skip empty abstracts | Martin Czygan | 2021-10-01 | 1 | -1/+4 | 
| |/ | | | | | Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required` | ||||
| * | more consistent and defensive lower-casing of DOIs | Bryan Newbold | 2021-06-23 | 2 | -1/+6 | 
| | | | | | | | | After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field. | ||||
| * | datacite: more careful title string access; fixes sentry #88350 | Martin Czygan | 2021-06-11 | 1 | -1/+1 | 
| | | | | | | Caused by a partial "title entry without title" coming *first* (e.g. just holding, e.g. a language, like: {'lang': 'da'} | ||||
| * | ingest: swap ingest and file checks, to result in clearer stats/counts of ↵ | Bryan Newbold | 2021-06-03 | 1 | -2/+2 | 
| | | | | | skipping | ||||
| * | ingest: don't accept mag and s2 URLs | Bryan Newbold | 2021-06-03 | 1 | -4/+4 | 
| | | |||||
| * | small python lint fixes (no behavior change) | Bryan Newbold | 2021-05-25 | 1 | -2/+0 | 
| | | |||||
| * | arabesque importer: ensure full 14-digit timestamps | Bryan Newbold | 2021-05-21 | 1 | -1/+3 | 
| | | |||||
| * | datacite: a missing surname should be None, not the empty string | Martin Czygan | 2021-04-02 | 1 | -2/+1 | 
| | | | | | refs sentry #77700 | ||||
| * | web ingest: terminal URL mismatch as skip, not assert | Bryan Newbold | 2020-12-30 | 1 | -1/+3 | 
| | | |||||
| * | dblp release import: skip arxiv_id releases | Bryan Newbold | 2020-12-24 | 1 | -0/+9 | 
| | | |||||
| * | dblp import: fix arxiv_id typo | Bryan Newbold | 2020-12-23 | 1 | -1/+1 | 
| | | | | | Would have been caught by mypy! | ||||
| * | ingest: allow dblp imports | Bryan Newbold | 2020-12-23 | 1 | -1/+1 | 
| | | |||||
| * | fuzzy: set 120 second timeout on ES lookups | Bryan Newbold | 2020-12-23 | 1 | -1/+1 | 
| | | |||||
| * | dblp: polish HTML scrape/extract pipeline | Bryan Newbold | 2020-12-17 | 1 | -0/+14 | 
| | | |||||
| * | dblp: flesh out update code path (especially to add container_id linkage) | Bryan Newbold | 2020-12-17 | 1 | -2/+6 | 
| | | |||||
| * | dblp: run fuzzy matching at try_update time (same as DOAJ) | Bryan Newbold | 2020-12-17 | 1 | -1/+8 | 
| | | |||||
| * | improve dblp release import | Bryan Newbold | 2020-12-17 | 1 | -1/+2 | 
| | | |||||
| * | very simple dblp container importer | Bryan Newbold | 2020-12-17 | 2 | -0/+145 | 
| | | |||||
| * | dblp release importer: container_id lookup TSV, and dump JSON mode | Bryan Newbold | 2020-12-17 | 1 | -10/+66 | 
| | | |||||
