fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	datacite: improve date handling and minor tweak	Martin Czygan	2020-01-30	2	-0/+110
\| \| \| \| \| \| \| \| \| \| \| \| \|	Records from https://www.micropublication.org/ did not have a date in FC, although raw data contained date strings - they were not using the finer-grained "attributes.date" but "attributes.published" and/or "attributes.publicationYear". Support for those fields has been added, including a test case. During this test (#30) a processing gap for names became clear (author may have "given_name" and "surname", but no "name"). This bug has been fixed, too.
*	do not normalize "en dash" in DOI	Martin Czygan	2020-01-17	1	-1/+1
\| \| \| \| \| \| \| \| \|	Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
*	ingest: improve tests, support old ingest results	Bryan Newbold	2020-01-15	2	-1/+2
\|
*	datacite: ignore known unknown values in resourceType*	Martin Czygan	2020-01-09	2	-0/+94
\|
*	datacite: abstracts may be strings or list of strings	Martin Czygan	2020-01-09	4	-0/+186
\|
*	datacite: improve license_slug handling	Martin Czygan	2020-01-09	2	-1/+3
\|
*	datacite: add 'Unknown' to blacklist	Martin Czygan	2020-01-09	1	-7/+1
\|
*	datacite: get rid of schemaVersion	Martin Czygan	2020-01-09	17	-32/+14
\|
*	datacite: reformat test cases and use jq . --sort-keys	Martin Czygan	2020-01-08	54	-2299/+2301
\|
*	datacite: factor out contributor handling	Martin Czygan	2020-01-08	4	-0/+105
\| \| \| \| \| \| \|	Use values from: * attributes.creators[] * attributes.contributors[]
*	datacite: adjust tests for release_month	Martin Czygan	2020-01-08	12	-12/+12
\|
*	datacite: mark additional files as stub	Martin Czygan	2020-01-08	2	-0/+72
\|
*	datacite: CCDC are entries, mostly	Martin Czygan	2020-01-08	1	-1/+1
\|
*	datacite: adding datacite-specific extra metadata	Martin Czygan	2020-01-07	30	-1468/+1570
\| \| \| \| \| \| \| \| \| \| \| \| \|	* attributes.metadataVersion * attributes.schemaVersion * attributes.version (source dependent values, follows suggestions in https://schema.datacite.org/meta/kernel-4.3/doc/DataCite-MetadataKernel_v4.3.pdf#page=26, but values vary) Furthermore: * attributes.types.resourceTypeGeneral * attributes.types.resourceType
*	datacite: month field should be top-level	Martin Czygan	2020-01-06	11	-14/+14
\|
*	datacite: include month in extra	Martin Czygan	2020-01-06	11	-11/+13
\| \| \| \| \|	> include release_month as a top-level extra field [...] to auto-populate the schema field from that
*	datacite: clean abstracts, use unknown value tokens	Martin Czygan	2020-01-06	3	-3/+3
\| \| \| \| \| \| \| \|	Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts.
*	datacite: always include "datacite" key in extra	Martin Czygan	2020-01-04	14	-26/+26
\| \| \| \| \| \|	> always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata.
*	datacite: remove --lang-detect flag	Martin Czygan	2020-01-03	5	-10/+15
\| \| \| \|	Estimated time for a single call is in the order of 50ms.
*	datacite: add another test case	Martin Czygan	2020-01-02	2	-0/+70
\|
*	datacite: open case for editing after creation	Martin Czygan	2020-01-02	1	-0/+2
\|
*	datacite: add helper script to create new test case	Martin Czygan	2020-01-02	1	-0/+14
\|
*	datacite: address raw_name index form comment	Martin Czygan	2020-01-02	19	-111/+111
\| \| \| \| \| \| \| \| \|	> The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically.
*	datacite: add conversion fixtures	Martin Czygan	2020-01-02	49	-0/+3924
\| \| \| \| \| \| \| \| \| \| \| \| \|	The `test_datacite_conversions` function will compare an input (datacite) document to an expected output (release entity as JSON). This way, it should not be too hard to add more cases by adding: input, output - and by increasing the counter in the range loop within the test. To view input and result side by side with vim, change into the test directory and run: tests/files/datacite $ ./caseview.sh 18
*	improve datacite field mapping and import	Martin Czygan	2019-12-28	2	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
*	datacite: add simple test and fixture for datacite api interaction	Martin Czygan	2019-12-27	1	-0/+1
\|
*	add regression test for medlinedate -> year parsing	Bryan Newbold	2019-12-23	1	-0/+95
\|
*	add basic test for crossref harvest API call	Bryan Newbold	2019-12-06	1	-0/+1
\|
*	ingest file result importer	Bryan Newbold	2019-11-15	1	-0/+1
\|
*	release elasticsearch results: stage not status	Bryan Newbold	2019-06-13	1	-1/+1
\|
*	JALC bulk file importer	Bryan Newbold	2019-05-21	1	-0/+100
\|
*	basic JALC XML DOI metadata parser	Bryan Newbold	2019-05-21	1	-0/+176
\|
*	basic JSTOR XML parser	Bryan Newbold	2019-05-21	1	-0/+58
\|
*	basic arxivraw XML parser	Bryan Newbold	2019-05-21	1	-0/+31
\|
*	basic pubmed parser	Bryan Newbold	2019-05-21	1	-0/+36822
\|
*	fix releases/release_ids in math_universe.json test file	Bryan Newbold	2019-05-20	1	-1/+1
\|
*	importer code updates	Bryan Newbold	2019-05-13	1	-1/+1
\|
*	update example release JSON to new schema (ext_id, release_stage)	Bryan Newbold	2019-05-13	2	-11/+11
\|
*	arabesque import tests	Bryan Newbold	2019-04-18	2	-0/+10
\|
*	many web test improvements	Bryan Newbold	2019-04-04	2	-0/+2
\|
*	more integration of transform refactor	Bryan Newbold	2019-03-11	1	-0/+10
\|
*	crossref import tweaks/fixes	Bryan Newbold	2019-01-29	1	-0/+1
\| \| \| \| \|	- refs: article-title not title; save unstructured; authors not author - save 'language' field (already an ISO code)
*	fix matched test vector	Bryan Newbold	2019-01-28	1	-1/+1
\| \| \| \|	this was resulting in a collision with default/example database objects.
*	update journal meta import/transform	Bryan Newbold	2019-01-25	2	-10/+20
\|
*	tweak crossref import, and update tests	Bryan Newbold	2019-01-24	1	-4/+20
\|
*	allow importing contrib/refs lists	Bryan Newbold	2019-01-24	1	-0/+0
\| \| \| \| \| \|	The motivation here isn't really to support these gigantic lists on principle, but to be able to ingest large corpuses without having to decide whether to filter out or crop such lists.
*	crossref importer updates	Bryan Newbold	2019-01-22	1	-1/+1
\|
*	fix file extraction (and transforms)	Bryan Newbold	2018-11-26	1	-0/+1
\|
*	improvements to grobid_metadata importer	Bryan Newbold	2018-09-27	1	-0/+10
\| \| \| \| \|	But still fails tests due to database collision/side-effect on sha1 lookup.
*	more python example files	Bryan Newbold	2018-09-22	2	-0/+424
\|