fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	datacite: limit abstract length	Martin Czygan	2019-12-28	1	-0/+6
\|
*	datacite: use iso 639-1 codes	Martin Czygan	2019-12-28	1	-7/+4
\|
*	address first round of MR14 comments	Martin Czygan	2019-12-28	1	-148/+319
\| \| \| \| \| \| \| \| \| \| \| \| \|	* add missing langdetect * use entity_to_dict for json debug output * factor out code for fields in function and add table driven tests * update citeproc types * add author as default role * add raw_affiliation * include relations from datacite * remove url (covered by doi already) Using yapf for python formatting.
*	datacite: move common date patterns out of the loop	Martin Czygan	2019-12-28	1	-3/+4
\| \| \| \|	Additionally, try the unspecific (%Y) pattern last.
*	improve datacite field mapping and import	Martin Czygan	2019-12-28	1	-41/+139
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
*	datacite: add missing mappings and notes	Martin Czygan	2019-12-28	1	-266/+175
\|
*	datacite: basic field mappings	Martin Czygan	2019-12-28	1	-41/+181
\| \| \| \| \| \| \| \| \| \|	Currently using two external libraries: * dateparser * langcodes Note: This commit includes lots of wip docs and field stat in comment, which should be removed.
*	datacite: importer skeleton	Martin Czygan	2019-12-28	2	-0/+459
\| \| \| \| \| \|	* contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo.
*	orcid: skip non-person ORCID records	Bryan Newbold	2019-12-26	1	-0/+4
\|
*	allow arabesque backfill ingests for some source types	Bryan Newbold	2019-12-24	1	-0/+5
\|
*	make chocula URL updates more conservative	Bryan Newbold	2019-12-24	1	-5/+5
\|
*	pubmed: if doing update, also do subtitle schema update	Bryan Newbold	2019-12-23	1	-1/+9
\|
*	pubmed: improve warning and stderr formatting	Bryan Newbold	2019-12-23	1	-5/+6
\|
*	pubmed: use standard identifier cleaners	Bryan Newbold	2019-12-23	1	-17/+14
\|
*	pubmed: remove unused extid mapping code	Bryan Newbold	2019-12-23	1	-29/+0
\|
*	pubmed: do reference lookups by default	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	pubmed: null doi parsing check	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	add basic MedlineDate year parsing	Bryan Newbold	2019-12-23	1	-0/+11
\|
*	fix spn/ingest importer duplication check	Bryan Newbold	2019-12-22	1	-6/+8
\| \| \| \| \| \|	Check was happing after the `return True` by mistake, allowing duplicates in SPN editgroups, and potentially in ingest request editgroups as well.
*	write diagnostic messages to stderr	Martin Czygan	2019-12-16	1	-2/+2
\| \| \| \| \|	During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
*	Merge branch 'martin-importers-common-doc-fix' into 'master'	Martin Czygan	2019-12-14	1	-13/+10
\|\ \| \| \| \| \| \| \| \|	Update EntityImporter docstring. See merge request webgroup/fatcat!9
\| *	complete parse_record docstring	Martin Czygan	2019-12-14	1	-0/+6
\| \|
\| *	Update EntityImporter docstring.	Martin Czygan	2019-12-13	1	-13/+4
\| \| \| \| \| \| \| \|	I believe the required method is `parse_record`, not `parse`.
* \|	add ingest import file collision protection	Bryan Newbold	2019-12-13	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The common case is the same URL being submitted repeatedly during testing. This is only within-editgroup, and per importer (eg, won't work across spn importer "submitted" editgroups), but is better than nothing.
* \|	update ingest request schema	Bryan Newbold	2019-12-13	1	-2/+7
\| \| \| \| \| \| \| \| \| \|	This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
* \|	remove default mimetype from ingest-file importer	Bryan Newbold	2019-12-13	1	-2/+1
\| \| \| \| \| \| \| \|	We really should just use file_meta result or nothing.
* \|	revert accidentally commited test timing	Bryan Newbold	2019-12-13	1	-2/+2
\| \| \| \| \| \| \| \|	Also fix a spurious typo.
* \|	ensure importer description arg isn't clobbered	Bryan Newbold	2019-12-12	3	-5/+5
\| \|
* \|	savepapernow result importer	Bryan Newbold	2019-12-12	2	-4/+65
\| \| \| \| \| \| \| \|	Based on ingest-file-results importer
* \|	flush importer editgroups every few minutes	Bryan Newbold	2019-12-12	1	-5/+20
\| \|
* \|	EntityImporter: submit (not accept) mode	Bryan Newbold	2019-12-12	1	-2/+14
\|/ \| \| \| \|	For use with bots that don't have admin privileges, or where human follow-up review is desired.
*	add another ingest request source to whitelist	Bryan Newbold	2019-12-10	1	-2/+5
\|
*	tweaks to file ingest importer	Bryan Newbold	2019-12-03	1	-3/+4
\| \| \| \| \|	- allow overriding source filter whitelist (common case for CLI use) - fix editgroup description env variable pass-through
*	re-order ingest want() for better stats	Bryan Newbold	2019-11-15	1	-7/+10
\|
*	project -> ingest_request_source	Bryan Newbold	2019-11-15	1	-6/+6
\|
*	ingest importer fixes	Bryan Newbold	2019-11-15	1	-3/+4
\|
*	more ingest importer comments and counts	Bryan Newbold	2019-11-15	1	-1/+28
\|
*	crude support for 'sandcrawler' kafka topics	Bryan Newbold	2019-11-15	1	-2/+3
\|
*	ingest file result importer	Bryan Newbold	2019-11-15	2	-2/+135
\|
*	crossref: accurate blank title counts	Bryan Newbold	2019-11-05	1	-0/+1
\|
*	crossref: component type	Bryan Newbold	2019-11-04	1	-1/+3
\|
*	crossref: count why skip happened	Bryan Newbold	2019-11-04	1	-1/+7
\| \| \| \| \| \|	Might skip based on release type (eg container, not a paper/release), or missing title, or other reasons. Over 7 million DOIs are getting skipped, curious why.
*	crossref: don't skip on short/null subtitle	Bryan Newbold	2019-11-04	1	-1/+1
\| \| \| \|	This was a bug. Should only set subtitle black, not skip the import.
*	refactor duplicated b32_hex function in importers	Bryan Newbold	2019-10-08	3	-21/+11
\|
*	review/fix all confluent-kafka produce code	Bryan Newbold	2019-09-20	1	-1/+0
\|
*	small fixes to confluent-kafka importers/workers	Bryan Newbold	2019-09-20	1	-10/+24
\| \| \| \| \| \| \| \|	- decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
*	small kafka tweaks for robustness	Bryan Newbold	2019-09-20	1	-0/+3
\|
*	convert importers to confluent-kafka library	Bryan Newbold	2019-09-20	1	-19/+71
\|
*	refactor all python source for client lib name	Bryan Newbold	2019-09-05	14	-106/+106
\|
*	fix Importer editgroup_extra pass-through	Bryan Newbold	2019-09-05	1	-2/+1
\|