fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	python: isort everything	Bryan Newbold	2021-11-02	17	-41/+70
\|
*	arabesque import 'hit' field is 1/0, not true/false	Bryan Newbold	2021-11-02	1	-2/+2
\|
*	lint: simple, safe inline lint fixes	Bryan Newbold	2021-11-02	12	-22/+21
\| \| \| \|	'==' vs 'is'; 'not a in b' vs 'a not in b'; etc
*	lint/fmt: remove all 'import *'	Bryan Newbold	2021-11-02	5	-21/+41
\|
*	re-fmt all the fatcat_tools __init__ files for readability	Bryan Newbold	2021-11-02	1	-17/+39
\|
*	small python tweaks for annotations, imports	Bryan Newbold	2021-11-02	2	-2/+6
\|
*	try some type annotations	Bryan Newbold	2021-11-02	2	-55/+63
\|
*	fix missing variable in fileset ingest	Bryan Newbold	2021-11-02	1	-2/+1
\|
*	WIP: more fileset ingest	Bryan Newbold	2021-10-18	1	-13/+21
\|
*	WIP: rel fixes	Bryan Newbold	2021-10-14	1	-6/+6
\|
*	fileset ingest small tweaks	Bryan Newbold	2021-10-14	1	-21/+36
\|
*	initial implementation of fileset ingest importers	Bryan Newbold	2021-10-14	2	-3/+224
\|
*	generic fileset importer class, with test coverage	Bryan Newbold	2021-10-14	3	-0/+88
\|
*	dblp import: basic support for handles as identifiers	Bryan Newbold	2021-10-13	1	-1/+5
\|
*	dblp import: fix typos in identifier parsing	Bryan Newbold	2021-10-13	1	-2/+1
\|
*	python: partial importer utilization of new schema changes	Bryan Newbold	2021-10-13	3	-6/+18
\|
*	Merge branch 'bnewbold-ingest-tweaks' into 'master'	bnewbold	2021-10-02	3	-39/+106
\|\ \| \| \| \| \| \| \| \|	ingest importer behavior tweaks See merge request webgroup/fatcat!120
\| *	kafka import: optional 'force-flush' mode for some importers	Bryan Newbold	2021-10-01	1	-0/+13
\| \| \| \| \| \| \| \|	Behavior and motivation described in the kafka json import comment.
\| *	new SPN web (html) importer	Bryan Newbold	2021-10-01	2	-27/+81
\| \|
\| *	ingest importer behavior tweaks	Bryan Newbold	2021-10-01	1	-8/+8
\| \| \| \| \| \| \| \| \| \|	- change order of 'want()' checks, so that result counts are clearer - don't require GROBID success for file imports with SPN
\| *	importer common: more verbose logging (with counts)	Bryan Newbold	2021-10-01	1	-4/+4
\| \|
* \|	datacite: skip empty abstracts	Martin Czygan	2021-10-01	1	-1/+4
\|/ \| \| \| \|	Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required`
*	more consistent and defensive lower-casing of DOIs	Bryan Newbold	2021-06-23	2	-1/+6
\| \| \| \| \| \| \|	After noticing more upper/lower ambiguity in production. In particular, we have some old ingest requests in sandcrawler DB, which get re-submitted/re-tried, which have capitalized DOIs in the link source id field.
*	datacite: more careful title string access; fixes sentry #88350	Martin Czygan	2021-06-11	1	-1/+1
\| \| \| \| \|	Caused by a partial "title entry without title" coming first (e.g. just holding, e.g. a language, like: {'lang': 'da'}
*	ingest: swap ingest and file checks, to result in clearer stats/counts of ↵	Bryan Newbold	2021-06-03	1	-2/+2
\| \| \| \|	skipping
*	ingest: don't accept mag and s2 URLs	Bryan Newbold	2021-06-03	1	-4/+4
\|
*	small python lint fixes (no behavior change)	Bryan Newbold	2021-05-25	1	-2/+0
\|
*	arabesque importer: ensure full 14-digit timestamps	Bryan Newbold	2021-05-21	1	-1/+3
\|
*	datacite: a missing surname should be None, not the empty string	Martin Czygan	2021-04-02	1	-2/+1
\| \| \| \|	refs sentry #77700
*	web ingest: terminal URL mismatch as skip, not assert	Bryan Newbold	2020-12-30	1	-1/+3
\|
*	dblp release import: skip arxiv_id releases	Bryan Newbold	2020-12-24	1	-0/+9
\|
*	dblp import: fix arxiv_id typo	Bryan Newbold	2020-12-23	1	-1/+1
\| \| \| \|	Would have been caught by mypy!
*	ingest: allow dblp imports	Bryan Newbold	2020-12-23	1	-1/+1
\|
*	fuzzy: set 120 second timeout on ES lookups	Bryan Newbold	2020-12-23	1	-1/+1
\|
*	dblp: polish HTML scrape/extract pipeline	Bryan Newbold	2020-12-17	1	-0/+14
\|
*	dblp: flesh out update code path (especially to add container_id linkage)	Bryan Newbold	2020-12-17	1	-2/+6
\|
*	dblp: run fuzzy matching at try_update time (same as DOAJ)	Bryan Newbold	2020-12-17	1	-1/+8
\|
*	improve dblp release import	Bryan Newbold	2020-12-17	1	-1/+2
\|
*	very simple dblp container importer	Bryan Newbold	2020-12-17	2	-0/+145
\|
*	dblp release importer: container_id lookup TSV, and dump JSON mode	Bryan Newbold	2020-12-17	1	-10/+66
\|
*	initial implementation of dblp release importer (in progress)	Bryan Newbold	2020-12-17	2	-0/+445
\|
*	add 'lxml' mode for large XML file import, and multi-tags	Bryan Newbold	2020-12-17	1	-15/+28
\|
*	add dblp as an ingest source and identifier	Bryan Newbold	2020-12-17	1	-1/+2
\|
*	ingest: allow doaj ingest responses	Bryan Newbold	2020-12-17	1	-1/+2
\|
*	update fuzzy helper to pass 'reason' through to import code	Bryan Newbold	2020-12-17	1	-3/+3
\| \| \| \| \|	The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases.
*	add fuzzy match filtering to DOAJ importer	Bryan Newbold	2020-12-16	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \|	In this default configuration, any entities with a fuzzy match (even "ambiguous") will be skipped at import time, to prevent creating duplicates. This is conservative towards not creating new/duplicate entities. In the future, as we get more confidence in fuzzy match/verification, we can start to ignore AMBIGUOUS, handle EXACT as same release, and merge STRONG (and WEAK?) matches under the same work entity.
*	add fuzzy matching helper to importer base class	Bryan Newbold	2020-12-16	1	-2/+62
\| \| \| \|	Using fuzzycat. Add basic test coverage.
*	html ingest: small fixes to try_update() code path	Bryan Newbold	2020-12-15	1	-5/+5
\| \| \| \| \|	Don't currently have test coverage for most try_update() code; run the inserts manually in testing.
*	crossref+datacite: remove confusing early update bail	Bryan Newbold	2020-11-20	2	-4/+0
\| \| \| \| \|	Easy to miss that we skip updates twice, and with this early bailout were not updating counts correctly.
*	doaj: fix update code path (getattr not __dict__)	Bryan Newbold	2020-11-20	1	-4/+3
\| \| \| \|	Also add missing code coverage for update path (disabled by default).