fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	Datacite API v2 throws 400, we cannot recover from, currently.	Martin Czygan	2019-12-27	1	-0/+4
\| \| \| \| \| \| \| \| \| \|	As a first iteration, just mark the daily batch complete and continue. The occasional HTTP 400 issue has been reported as https://github.com/datacite/datacite/issues/897. A possible improvement would be to shrink the window, so losses will be smaller.
*	datacite: update documentation, add links to issues	Martin Czygan	2019-12-27	1	-10/+5
\|
*	datacite: use v2 of the API (flaky)	Martin Czygan	2019-12-27	1	-5/+28
\| \| \| \| \| \| \| \| \|	Update parameter update for datacite API v2. Works fine, but there are occasional HTTP 400 responses when using the cursor API (daily updates can exceed the 10000 record limit for search queries). The HTTP 400 issue is not solved yet, but reported to datacite as https://github.com/datacite/datacite/issues/897.
*	transform ingests via pmc/pmcid, not pubmed/pmid	Bryan Newbold	2019-12-24	1	-4/+4
\|
*	allow arabesque backfill ingests for some source types	Bryan Newbold	2019-12-24	1	-0/+5
\|
*	make chocula URL updates more conservative	Bryan Newbold	2019-12-24	1	-5/+5
\|
*	pubmed: if doing update, also do subtitle schema update	Bryan Newbold	2019-12-23	1	-1/+9
\|
*	doi parsing fixes	Bryan Newbold	2019-12-23	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	Replace emdash with regular dash. Replace double slash after partner ID with single slash. This conversion seems to be done by crossref automatically on lookup. I tried several examples, using doi.org resolver and Crossref API lookup. Note that there are a number of fatcat entities with '//' in the DOI.
*	pubmed: improve warning and stderr formatting	Bryan Newbold	2019-12-23	1	-5/+6
\|
*	pubmed: use standard identifier cleaners	Bryan Newbold	2019-12-23	1	-17/+14
\|
*	pubmed: remove unused extid mapping code	Bryan Newbold	2019-12-23	1	-29/+0
\|
*	pubmed: do reference lookups by default	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	normalizers: clean_pmid(), and handle nulls in all other cleaners	Bryan Newbold	2019-12-23	1	-0/+31
\|
*	pubmed: null doi parsing check	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	add basic MedlineDate year parsing	Bryan Newbold	2019-12-23	1	-0/+11
\|
*	fix spn/ingest importer duplication check	Bryan Newbold	2019-12-22	1	-6/+8
\| \| \| \| \| \|	Check was happing after the `return True` by mistake, allowing duplicates in SPN editgroups, and potentially in ingest request editgroups as well.
*	write diagnostic messages to stderr	Martin Czygan	2019-12-16	1	-2/+2
\| \| \| \| \|	During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
*	Merge branch 'martin-importers-common-doc-fix' into 'master'	Martin Czygan	2019-12-14	1	-13/+10
\|\ \| \| \| \| \| \| \| \|	Update EntityImporter docstring. See merge request webgroup/fatcat!9
\| *	complete parse_record docstring	Martin Czygan	2019-12-14	1	-0/+6
\| \|
\| *	Update EntityImporter docstring.	Martin Czygan	2019-12-13	1	-13/+4
\| \| \| \| \| \| \| \|	I believe the required method is `parse_record`, not `parse`.
* \|	add ingest import file collision protection	Bryan Newbold	2019-12-13	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The common case is the same URL being submitted repeatedly during testing. This is only within-editgroup, and per importer (eg, won't work across spn importer "submitted" editgroups), but is better than nothing.
* \|	update ingest request schema	Bryan Newbold	2019-12-13	3	-8/+30
\| \| \| \| \| \| \| \| \| \|	This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
* \|	remove default mimetype from ingest-file importer	Bryan Newbold	2019-12-13	1	-2/+1
\| \| \| \| \| \| \| \|	We really should just use file_meta result or nothing.
* \|	revert accidentally commited test timing	Bryan Newbold	2019-12-13	1	-2/+2
\| \| \| \| \| \| \| \|	Also fix a spurious typo.
* \|	ensure importer description arg isn't clobbered	Bryan Newbold	2019-12-12	3	-5/+5
\| \|
* \|	tweaks to ingest-file transform	Bryan Newbold	2019-12-12	1	-13/+7
\| \|
* \|	savepapernow result importer	Bryan Newbold	2019-12-12	2	-4/+65
\| \| \| \| \| \| \| \|	Based on ingest-file-results importer
* \|	flush importer editgroups every few minutes	Bryan Newbold	2019-12-12	1	-5/+20
\| \|
* \|	EntityImporter: submit (not accept) mode	Bryan Newbold	2019-12-12	1	-2/+14
\|/ \| \| \| \|	For use with bots that don't have admin privileges, or where human follow-up review is desired.
*	factor out some basic kafka helpers	Bryan Newbold	2019-12-10	2	-0/+23
\|
*	add another ingest request source to whitelist	Bryan Newbold	2019-12-10	1	-2/+5
\|
*	refactor kafka producer in crossref harvester	Bryan Newbold	2019-12-06	1	-21/+26
\| \| \| \| \| \| \| \|	producer creation/configuration should be happening in __init__() time, not 'daily' call. This specific refactor motivated by mocking out the producer in unit tests.
*	tweaks to file ingest importer	Bryan Newbold	2019-12-03	1	-3/+4
\| \| \| \| \|	- allow overriding source filter whitelist (common case for CLI use) - fix editgroup description env variable pass-through
*	crossref is_update isn't what I thought	Bryan Newbold	2019-12-03	1	-6/+2
\| \| \| \| \| \| \| \|	I thought this would filter for metadata updates to an existing DOI, but actually "updates" are a type of DOI (eg, a retraction). TODO: handle 'updates' field. Should both do a lookup and set work_ident appropriately, and store in crossref-specific metadata.
*	re-order ingest want() for better stats	Bryan Newbold	2019-11-15	1	-7/+10
\|
*	project -> ingest_request_source	Bryan Newbold	2019-11-15	3	-9/+9
\|
*	fix release.pmcid typo	Bryan Newbold	2019-11-15	1	-2/+2
\|
*	ingest importer fixes	Bryan Newbold	2019-11-15	1	-3/+4
\|
*	more ingest importer comments and counts	Bryan Newbold	2019-11-15	2	-2/+29
\|
*	crude support for 'sandcrawler' kafka topics	Bryan Newbold	2019-11-15	1	-2/+3
\|
*	ingest file result importer	Bryan Newbold	2019-11-15	2	-2/+135
\|
*	add ingest request feature to entity_updates worker	Bryan Newbold	2019-11-15	1	-4/+20
\| \| \| \| \| \| \| \| \| \| \| \| \|	Initially was going to create a new worker to consume from the release update channel, but couldn't get the edit context ("is this a new release, or update to an existing") from that context. Currently there is a flag in source code to control whether we only do OA releases or all releases. Starting with OA only to start slow, but should probably default to all, and make this a config flag. Should probably also have a config flag to control this entire feature. Tested locally in dev.
*	add ingest request transform (and test)	Bryan Newbold	2019-11-15	2	-0/+67
\|
*	crossref: accurate blank title counts	Bryan Newbold	2019-11-05	1	-0/+1
\|
*	crossref: component type	Bryan Newbold	2019-11-04	1	-1/+3
\|
*	crossref: count why skip happened	Bryan Newbold	2019-11-04	1	-1/+7
\| \| \| \| \| \|	Might skip based on release type (eg container, not a paper/release), or missing title, or other reasons. Over 7 million DOIs are getting skipped, curious why.
*	crossref: don't skip on short/null subtitle	Bryan Newbold	2019-11-04	1	-1/+1
\| \| \| \|	This was a bug. Should only set subtitle black, not skip the import.
*	file cleanup tweaks to actually run	Bryan Newbold	2019-10-08	2	-5/+4
\|
*	refactor duplicated b32_hex function in importers	Bryan Newbold	2019-10-08	3	-21/+11
\|
*	dict wrapper for entity_from_json()	Bryan Newbold	2019-10-08	2	-3/+7
\|