fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
...
* \|	datacite: improve date handling and minor tweak	Martin Czygan	2020-01-30	1	-19/+42
\|/ \| \| \| \| \| \| \| \| \| \| \| \|	Records from https://www.micropublication.org/ did not have a date in FC, although raw data contained date strings - they were not using the finer-grained "attributes.date" but "attributes.published" and/or "attributes.publicationYear". Support for those fields has been added, including a test case. During this test (#30) a processing gap for names became clear (author may have "given_name" and "surname", but no "name"). This bug has been fixed, too.
*	fix KafkaError worker reporting for partition errors	Bryan Newbold	2020-01-29	3	-3/+3
\|
*	additional DOI prefix filters	Bryan Newbold	2020-01-28	1	-0/+8
\| \| \| \|	From martin, thanks.
*	apply ingest request filtering in entity worker	Bryan Newbold	2020-01-28	1	-3/+34
\| \| \| \| \| \| \|	`ingest_oa_only` behavior, and other filters, now handled in the entity update worker, instead of in the transform function. Also add a DOI prefix blocklist feature.
*	remove 'oa_only' feature from ingest transform	Bryan Newbold	2020-01-28	1	-14/+1
\| \| \| \|	Refactoring to move this filter elsewhere
*	fix trivial typo in file importer	Bryan Newbold	2020-01-20	1	-1/+1
\|
*	normal: DOI corner-case from pubmed import	Bryan Newbold	2020-01-19	1	-0/+9
\|
*	do not normalize "en dash" in DOI	Martin Czygan	2020-01-17	1	-2/+5
\| \| \| \| \| \| \| \| \|	Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
*	ingest: improve tests, support old ingest results	Bryan Newbold	2020-01-15	1	-3/+12
\|
*	update ingest worker for schema tweaks	Bryan Newbold	2020-01-15	1	-8/+15
\| \| \| \| \| \|	Should be backwards compatible with old ingest results. Fixed a bug with glutton ident detection.
*	ingest: allow more sources to auto-import	Bryan Newbold	2020-01-15	1	-1/+2
\|
*	datacite: skip records without a doi	Martin Czygan	2020-01-13	1	-0/+4
\|
*	datacite: add entry to license slug map	Martin Czygan	2020-01-09	1	-0/+1
\|
*	datacite: ignore known unknown values in resourceType*	Martin Czygan	2020-01-09	1	-2/+2
\|
*	datacite: abstracts may be strings or list of strings	Martin Czygan	2020-01-09	1	-2/+15
\|
*	datacite: improve license_slug handling	Martin Czygan	2020-01-09	1	-60/+101
\|
*	datacite: add 'Unknown' to blacklist	Martin Czygan	2020-01-09	1	-1/+5
\|
*	datacite: get rid of schemaVersion	Martin Czygan	2020-01-09	1	-3/+0
\|
*	Merge branch 'martin-datacite-import'	Martin Czygan	2020-01-08	2	-0/+1024
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pipfile.lock is broken. * martin-datacite-import: (68 commits) datacite: pass in doi into factored out method datacite: reformat test cases and use jq . --sort-keys datacite: factor out contributor handling datacite: catch type mismatch in language detection datacite: adjust tests for release_month datacite: name extra.month, extra.release_month datacite: mark additional files as stub datacite: CCDC are entries, mostly datacite: use more specific release_type, if possible datacite: ignore certain names datacite: over 3% records have the same title: stub datacite: fill a few more release_type gaps datacite: adding datacite-specific extra metadata datacite: apply pylint suggestions datacite: fix typos datacite: set release_stage to published by default datacite: month field should be top-level datacite: include month in extra datacite: indicate mismatched file in test datacite: clean abstracts, use unknown value tokens ...
\| *	datacite: pass in doi into factored out method	Martin Czygan	2020-01-08	1	-2/+3
\| \|
\| *	datacite: factor out contributor handling	Martin Czygan	2020-01-08	1	-80/+103
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use values from: * attributes.creators[] * attributes.contributors[]
\| *	datacite: catch type mismatch in language detection	Martin Czygan	2020-01-08	1	-3/+2
\| \|
\| *	datacite: name extra.month, extra.release_month	Martin Czygan	2020-01-08	1	-1/+3
\| \|
\| *	datacite: mark additional files as stub	Martin Czygan	2020-01-08	1	-0/+4
\| \|
\| *	datacite: CCDC are entries, mostly	Martin Czygan	2020-01-08	1	-0/+4
\| \|
\| *	datacite: use more specific release_type, if possible	Martin Czygan	2020-01-08	1	-0/+6
\| \|
\| *	datacite: ignore certain names	Martin Czygan	2020-01-08	1	-0/+6
\| \|
\| *	datacite: over 3% records have the same title: stub	Martin Czygan	2020-01-08	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The GBIF (https://www.gbif.org/) deposits most records under the titles: * 599243 GBIF Occurrence Download * 41176 Occurrence Download Mark them as "stub" for the moment (https://guide.fatcat.wiki/entity_release.html#release_type-vocabulary).
\| *	datacite: fill a few more release_type gaps	Martin Czygan	2020-01-08	1	-17/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* citeproc: http://docs.citationstyles.org/en/stable/specification.html#appendix-iii-types * resourceTypeGeneral: https://schema.datacite.org/meta/kernel-4.0/doc/DataCite-MetadataKernel_v4.0.pdf#page=32 * resourceType: uncontrolled, over 170000 distinct values, frequent: null, Dataset, JournalArticle, PGRFA Material, Journal Article, Dataset/UNITE Species Hypothesis, ... General frequency: * "attributes.types": 18210075, * "attributes.types.ris": 18058890, * "attributes.types.bibtex": 18058888, * "attributes.types.citeproc": 18058890, * "attributes.types.schemaOrg": 18058929, * "attributes.types.resourceType": 12737988, * "attributes.types.resourceTypeGeneral": 16576139,
\| *	datacite: adding datacite-specific extra metadata	Martin Czygan	2020-01-07	1	-0/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* attributes.metadataVersion * attributes.schemaVersion * attributes.version (source dependent values, follows suggestions in https://schema.datacite.org/meta/kernel-4.3/doc/DataCite-MetadataKernel_v4.3.pdf#page=26, but values vary) Furthermore: * attributes.types.resourceTypeGeneral * attributes.types.resourceType
\| *	datacite: apply pylint suggestions	Martin Czygan	2020-01-07	1	-8/+10
\| \|
\| *	datacite: fix typos	Martin Czygan	2020-01-07	1	-1/+1
\| \|
\| *	datacite: set release_stage to published by default	Martin Czygan	2020-01-06	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Set to `None` only if there is no publisher yet. Docs: https://support.datacite.org/docs/doi-states
\| *	datacite: month field should be top-level	Martin Czygan	2020-01-06	1	-2/+2
\| \|
\| *	datacite: include month in extra	Martin Czygan	2020-01-06	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	> include release_month as a top-level extra field [...] to auto-populate the schema field from that
\| *	datacite: clean abstracts, use unknown value tokens	Martin Czygan	2020-01-06	1	-4/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts.
\| *	datacite: clean abstract as well	Martin Czygan	2020-01-06	1	-1/+1
\| \|
\| *	datacite: filter out 'Cites' relation as well	Martin Czygan	2020-01-06	1	-1/+1
\| \|
\| *	datacite: always include "datacite" key in extra	Martin Czygan	2020-01-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	> always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata.
\| *	datacite: use normal.clean_doi	Martin Czygan	2020-01-03	1	-11/+1
\| \|
\| *	datacite: parse_datacite_dates returns month	Martin Czygan	2020-01-03	1	-10/+35
\| \| \| \| \| \| \| \|	As [...] we will soon add support for release_month field in the release schema.
\| *	datacite: prepare release_month (stub)	Martin Czygan	2020-01-03	1	-10/+10
\| \|
\| *	datacite: lowercase only once	Martin Czygan	2020-01-03	1	-3/+4
\| \|
\| *	datacite: remove --lang-detect flag	Martin Czygan	2020-01-03	1	-11/+6
\| \| \| \| \| \| \| \|	Estimated time for a single call is in the order of 50ms.
\| *	datacite: address raw_name index form comment	Martin Czygan	2020-01-02	1	-0/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	> The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically.
\| *	datacite: add two more skipable tokens	Martin Czygan	2020-01-02	1	-1/+1
\| \|
\| *	datacite: names can be 'Unav', too	Martin Czygan	2020-01-02	1	-1/+4
\| \|
\| *	datacite: avoid more None values	Martin Czygan	2020-01-01	1	-4/+4
\| \|
\| *	datacite: address 'Unpublished' publisher	Martin Czygan	2019-12-31	1	-9/+10
\| \|
\| *	datacite: ensure name schema is defined	Martin Czygan	2019-12-31	1	-1/+2
\| \|