fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	datacite: set release_stage to published by default	Martin Czygan	2020-01-06	1	-4/+5
\| \| \| \| \| \|	Set to `None` only if there is no publisher yet. Docs: https://support.datacite.org/docs/doi-states
*	datacite: month field should be top-level	Martin Czygan	2020-01-06	1	-2/+2
\|
*	datacite: include month in extra	Martin Czygan	2020-01-06	1	-0/+2
\| \| \| \| \|	> include release_month as a top-level extra field [...] to auto-populate the schema field from that
*	datacite: clean abstracts, use unknown value tokens	Martin Czygan	2020-01-06	1	-4/+26
\| \| \| \| \| \| \| \|	Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts.
*	datacite: clean abstract as well	Martin Czygan	2020-01-06	1	-1/+1
\|
*	datacite: filter out 'Cites' relation as well	Martin Czygan	2020-01-06	1	-1/+1
\|
*	datacite: always include "datacite" key in extra	Martin Czygan	2020-01-04	1	-2/+2
\| \| \| \| \| \|	> always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata.
*	datacite: use normal.clean_doi	Martin Czygan	2020-01-03	1	-11/+1
\|
*	datacite: parse_datacite_dates returns month	Martin Czygan	2020-01-03	1	-10/+35
\| \| \| \|	As [...] we will soon add support for release_month field in the release schema.
*	datacite: prepare release_month (stub)	Martin Czygan	2020-01-03	1	-10/+10
\|
*	datacite: lowercase only once	Martin Czygan	2020-01-03	1	-3/+4
\|
*	datacite: remove --lang-detect flag	Martin Czygan	2020-01-03	1	-11/+6
\| \| \| \|	Estimated time for a single call is in the order of 50ms.
*	datacite: address raw_name index form comment	Martin Czygan	2020-01-02	1	-0/+43
\| \| \| \| \| \| \| \| \|	> The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically.
*	datacite: add two more skipable tokens	Martin Czygan	2020-01-02	1	-1/+1
\|
*	datacite: names can be 'Unav', too	Martin Czygan	2020-01-02	1	-1/+4
\|
*	datacite: avoid more None values	Martin Czygan	2020-01-01	1	-4/+4
\|
*	datacite: address 'Unpublished' publisher	Martin Czygan	2019-12-31	1	-9/+10
\|
*	datacite: ensure name schema is defined	Martin Czygan	2019-12-31	1	-1/+2
\|
*	datacite: fix typo	Martin Czygan	2019-12-31	1	-1/+1
\|
*	datacite: isascii was added in 3.7, only	Martin Czygan	2019-12-31	1	-1/+7
\|
*	datacite: skip non-ascii doi for now	Martin Czygan	2019-12-31	1	-0/+4
\| \| \| \| \| \|	Example of a non-ascii doi: * https://doi.org/10.13125/américacrítica/3017
*	datacite: clean doi	Martin Czygan	2019-12-31	1	-1/+13
\| \| \| \| \| \| \|	address issue with EN DASH DOI. > "external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.25513/1812-3996.2017.1.34–42"
*	datacite: update docs	Martin Czygan	2019-12-31	1	-9/+9
\|
*	datacite: perform additional checks on contrib	Martin Czygan	2019-12-30	1	-3/+9
\|
*	datacite: check for empty title after clean	Martin Czygan	2019-12-29	1	-2/+5
\|
*	datacite: update docs with observed values	Martin Czygan	2019-12-29	1	-1/+3
\|
*	datacite: page number misses are too common	Martin Czygan	2019-12-28	1	-1/+2
\| \| \| \| \| \|	Should be a level debug, not info. Examples: E675, n/a, 15D.2.1, 15D.2.1, A.1E.1, A.1E.1, ...
*	datacite: suppress debug-like language lookup miss message	Martin Czygan	2019-12-28	1	-1/+3
\|
*	datacite: treat untyped names as people	Martin Czygan	2019-12-28	1	-1/+1
\|
*	datacite: include container_name top level key in extra	Martin Czygan	2019-12-28	1	-7/+21
\|
*	datacite: use clean on field values	Martin Czygan	2019-12-28	1	-2/+28
\|
*	datacite: include doi in error messages	Martin Czygan	2019-12-28	1	-8/+8
\|
*	datacite: limit abstract length	Martin Czygan	2019-12-28	1	-0/+6
\|
*	datacite: use iso 639-1 codes	Martin Czygan	2019-12-28	1	-7/+4
\|
*	address first round of MR14 comments	Martin Czygan	2019-12-28	1	-148/+319
\| \| \| \| \| \| \| \| \| \| \| \| \|	* add missing langdetect * use entity_to_dict for json debug output * factor out code for fields in function and add table driven tests * update citeproc types * add author as default role * add raw_affiliation * include relations from datacite * remove url (covered by doi already) Using yapf for python formatting.
*	datacite: move common date patterns out of the loop	Martin Czygan	2019-12-28	1	-3/+4
\| \| \| \|	Additionally, try the unspecific (%Y) pattern last.
*	improve datacite field mapping and import	Martin Czygan	2019-12-28	1	-41/+139
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
*	datacite: add missing mappings and notes	Martin Czygan	2019-12-28	1	-266/+175
\|
*	datacite: basic field mappings	Martin Czygan	2019-12-28	1	-41/+181
\| \| \| \| \| \| \| \| \| \|	Currently using two external libraries: * dateparser * langcodes Note: This commit includes lots of wip docs and field stat in comment, which should be removed.
*	datacite: importer skeleton	Martin Czygan	2019-12-28	2	-0/+459
\| \| \| \| \| \|	* contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo.
*	orcid: skip non-person ORCID records	Bryan Newbold	2019-12-26	1	-0/+4
\|
*	allow arabesque backfill ingests for some source types	Bryan Newbold	2019-12-24	1	-0/+5
\|
*	make chocula URL updates more conservative	Bryan Newbold	2019-12-24	1	-5/+5
\|
*	pubmed: if doing update, also do subtitle schema update	Bryan Newbold	2019-12-23	1	-1/+9
\|
*	pubmed: improve warning and stderr formatting	Bryan Newbold	2019-12-23	1	-5/+6
\|
*	pubmed: use standard identifier cleaners	Bryan Newbold	2019-12-23	1	-17/+14
\|
*	pubmed: remove unused extid mapping code	Bryan Newbold	2019-12-23	1	-29/+0
\|
*	pubmed: do reference lookups by default	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	pubmed: null doi parsing check	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	add basic MedlineDate year parsing	Bryan Newbold	2019-12-23	1	-0/+11
\|