fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	datacite: parse_datacite_dates returns month	Martin Czygan	2020-01-03	1	-10/+35
\| \| \| \|	As [...] we will soon add support for release_month field in the release schema.
*	datacite: prepare release_month (stub)	Martin Czygan	2020-01-03	1	-10/+10
\|
*	datacite: lowercase only once	Martin Czygan	2020-01-03	1	-3/+4
\|
*	datacite: remove --lang-detect flag	Martin Czygan	2020-01-03	1	-11/+6
\| \| \| \|	Estimated time for a single call is in the order of 50ms.
*	datacite: address raw_name index form comment	Martin Czygan	2020-01-02	1	-0/+43
\| \| \| \| \| \| \| \| \|	> The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically.
*	datacite: add two more skipable tokens	Martin Czygan	2020-01-02	1	-1/+1
\|
*	datacite: names can be 'Unav', too	Martin Czygan	2020-01-02	1	-1/+4
\|
*	datacite: avoid more None values	Martin Czygan	2020-01-01	1	-4/+4
\|
*	datacite: address 'Unpublished' publisher	Martin Czygan	2019-12-31	1	-9/+10
\|
*	datacite: ensure name schema is defined	Martin Czygan	2019-12-31	1	-1/+2
\|
*	datacite: fix typo	Martin Czygan	2019-12-31	1	-1/+1
\|
*	datacite: isascii was added in 3.7, only	Martin Czygan	2019-12-31	1	-1/+7
\|
*	datacite: skip non-ascii doi for now	Martin Czygan	2019-12-31	1	-0/+4
\| \| \| \| \| \|	Example of a non-ascii doi: * https://doi.org/10.13125/américacrítica/3017
*	datacite: clean doi	Martin Czygan	2019-12-31	1	-1/+13
\| \| \| \| \| \| \|	address issue with EN DASH DOI. > "external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.25513/1812-3996.2017.1.34–42"
*	datacite: update docs	Martin Czygan	2019-12-31	1	-9/+9
\|
*	datacite: perform additional checks on contrib	Martin Czygan	2019-12-30	1	-3/+9
\|
*	datacite: check for empty title after clean	Martin Czygan	2019-12-29	1	-2/+5
\|
*	datacite: update docs with observed values	Martin Czygan	2019-12-29	1	-1/+3
\|
*	datacite: page number misses are too common	Martin Czygan	2019-12-28	1	-1/+2
\| \| \| \| \| \|	Should be a level debug, not info. Examples: E675, n/a, 15D.2.1, 15D.2.1, A.1E.1, A.1E.1, ...
*	datacite: suppress debug-like language lookup miss message	Martin Czygan	2019-12-28	1	-1/+3
\|
*	datacite: treat untyped names as people	Martin Czygan	2019-12-28	1	-1/+1
\|
*	datacite: include container_name top level key in extra	Martin Czygan	2019-12-28	1	-7/+21
\|
*	datacite: use clean on field values	Martin Czygan	2019-12-28	1	-2/+28
\|
*	datacite: include doi in error messages	Martin Czygan	2019-12-28	1	-8/+8
\|
*	datacite: limit abstract length	Martin Czygan	2019-12-28	1	-0/+6
\|
*	datacite: use iso 639-1 codes	Martin Czygan	2019-12-28	1	-7/+4
\|
*	address first round of MR14 comments	Martin Czygan	2019-12-28	1	-148/+319
\| \| \| \| \| \| \| \| \| \| \| \| \|	* add missing langdetect * use entity_to_dict for json debug output * factor out code for fields in function and add table driven tests * update citeproc types * add author as default role * add raw_affiliation * include relations from datacite * remove url (covered by doi already) Using yapf for python formatting.
*	datacite: move common date patterns out of the loop	Martin Czygan	2019-12-28	1	-3/+4
\| \| \| \|	Additionally, try the unspecific (%Y) pattern last.
*	improve datacite field mapping and import	Martin Czygan	2019-12-28	1	-41/+139
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
*	datacite: add missing mappings and notes	Martin Czygan	2019-12-28	1	-266/+175
\|
*	datacite: basic field mappings	Martin Czygan	2019-12-28	1	-41/+181
\| \| \| \| \| \| \| \| \| \|	Currently using two external libraries: * dateparser * langcodes Note: This commit includes lots of wip docs and field stat in comment, which should be removed.
*	datacite: importer skeleton	Martin Czygan	2019-12-28	2	-0/+459
\| \| \| \| \| \|	* contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo.
*	orcid: skip non-person ORCID records	Bryan Newbold	2019-12-26	1	-0/+4
\|
*	allow arabesque backfill ingests for some source types	Bryan Newbold	2019-12-24	1	-0/+5
\|
*	make chocula URL updates more conservative	Bryan Newbold	2019-12-24	1	-5/+5
\|
*	pubmed: if doing update, also do subtitle schema update	Bryan Newbold	2019-12-23	1	-1/+9
\|
*	pubmed: improve warning and stderr formatting	Bryan Newbold	2019-12-23	1	-5/+6
\|
*	pubmed: use standard identifier cleaners	Bryan Newbold	2019-12-23	1	-17/+14
\|
*	pubmed: remove unused extid mapping code	Bryan Newbold	2019-12-23	1	-29/+0
\|
*	pubmed: do reference lookups by default	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	pubmed: null doi parsing check	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	add basic MedlineDate year parsing	Bryan Newbold	2019-12-23	1	-0/+11
\|
*	fix spn/ingest importer duplication check	Bryan Newbold	2019-12-22	1	-6/+8
\| \| \| \| \| \|	Check was happing after the `return True` by mistake, allowing duplicates in SPN editgroups, and potentially in ingest request editgroups as well.
*	write diagnostic messages to stderr	Martin Czygan	2019-12-16	1	-2/+2
\| \| \| \| \|	During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
*	Merge branch 'martin-importers-common-doc-fix' into 'master'	Martin Czygan	2019-12-14	1	-13/+10
\|\ \| \| \| \| \| \| \| \|	Update EntityImporter docstring. See merge request webgroup/fatcat!9
\| *	complete parse_record docstring	Martin Czygan	2019-12-14	1	-0/+6
\| \|
\| *	Update EntityImporter docstring.	Martin Czygan	2019-12-13	1	-13/+4
\| \| \| \| \| \| \| \|	I believe the required method is `parse_record`, not `parse`.
* \|	add ingest import file collision protection	Bryan Newbold	2019-12-13	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The common case is the same URL being submitted repeatedly during testing. This is only within-editgroup, and per importer (eg, won't work across spn importer "submitted" editgroups), but is better than nothing.
* \|	update ingest request schema	Bryan Newbold	2019-12-13	1	-2/+7
\| \| \| \| \| \| \| \| \| \|	This is mostly changing ingest_type from 'file' to 'pdf', and adding 'link_source'/'link_source_id', plus some small cleanups.
* \|	remove default mimetype from ingest-file importer	Bryan Newbold	2019-12-13	1	-2/+1
\| \| \| \| \| \| \| \|	We really should just use file_meta result or nothing.