fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	improve release elasticsearch transform test coverage	Bryan Newbold	2020-12-16	2	-0/+2
\|
*	doaj: fix update code path (getattr not __dict__)	Bryan Newbold	2020-11-20	1	-1/+1
\| \| \| \|	Also add missing code coverage for update path (disabled by default).
*	initial implementation of DOAJ importer	Bryan Newbold	2020-11-19	1	-0/+5
\| \| \| \|	Several things to finish implementing and polish.
*	ingest: fix XML ingest test file	Bryan Newbold	2020-11-05	1	-1/+1
\|
*	ingest: progress on HTML ingest	Bryan Newbold	2020-11-05	1	-0/+1
\|
*	ingest: tests for basic XML ingest	Bryan Newbold	2020-11-05	1	-0/+1
\|
*	ingest: basic checks for ingest_type	Bryan Newbold	2020-11-05	1	-1/+1
\|
*	datacite: handle case of empty-string version	Bryan Newbold	2020-09-10	1	-1/+1
\| \| \| \| \|	Includes a tiny tweak to the datacite import sample file to test this code path.
*	fixes and test coverage for file_meta importer	Bryan Newbold	2020-08-21	1	-0/+7
\|
*	datacite importer: update test cases for 'Additional file' as component, not ↵	Bryan Newbold	2020-08-11	5	-5/+5
\| \| \| \|	stub
*	datacite import: figshare-specific hacks	Bryan Newbold	2020-08-11	1	-0/+1
\|
*	datacite: adjust tests	Martin Czygan	2020-07-10	4	-10/+6
\|
*	wip: contrib, GH59	Martin Czygan	2020-07-10	5	-3/+105
\|
*	datacite: address duplicated contributor issue	Martin Czygan	2020-07-07	4	-10/+93
\| \| \| \| \| \| \|	Use string comparison. * https://fatcat.wiki/release/spjysmrnsrgyzgq6ise5o44rlu/contribs * https://api.datacite.org/dois/10.25940/roper-31098406
*	regression test for release_stage mismatch with ingest request	Bryan Newbold	2020-05-26	1	-1/+2
\|
*	datacite: fix type error	Martin Czygan	2020-04-22	2	-0/+76
\| \| \| \| \| \| \|	Up to now, we expected the description to be a string or list. Add handling for int as well. First appeared: Apr 22 19:58:39.
*	datacite: fix a raw name constraint violation	Martin Czygan	2020-04-20	2	-0/+77
\| \| \| \| \| \| \|	It was possible that contribs got added which had no raw name. One example would be a name consisting of whitespace only. This fix adds a final check for this case.
*	pubmed: handle multiple ReferenceList	Bryan Newbold	2020-03-20	1	-0/+206
\| \| \| \| \| \| \|	This resolves a situation noticed in prod where we were only importing/updating a single reference per article. Includes a regression test.
*	Merge branch 'martin-kafka-bs4-import' into 'master'	Martin Czygan	2020-03-10	2	-0/+0
\|\ \| \| \| \| \| \| \| \|	pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28
\| *	more pubmed adjustments	Martin Czygan	2020-02-22	2	-0/+0
\| \| \| \| \| \| \| \| \| \|	* regenerate map in continuous mode * add tests
* \|	Merge branch 'bnewbold-elastic-v03b'	Bryan Newbold	2020-02-26	3	-0/+3
\|\ \
\| * \|	fix some transform bugs, add some tests	Bryan Newbold	2020-01-29	3	-0/+3
\| \| \|
* \| \|	shadow import: more filtering of file_meta fields	Bryan Newbold	2020-02-13	1	-12/+10
\| \| \|
* \| \|	basic shadow importer	Bryan Newbold	2020-02-13	1	-0/+12
\| \|/ \|/\|
* \|	datacite: add exception for https://www.micropublication.org/	Martin Czygan	2020-01-31	1	-1/+2
\| \|
* \|	datacite: improve date handling and minor tweak	Martin Czygan	2020-01-30	2	-0/+110
\|/ \| \| \| \| \| \| \| \| \| \| \| \|	Records from https://www.micropublication.org/ did not have a date in FC, although raw data contained date strings - they were not using the finer-grained "attributes.date" but "attributes.published" and/or "attributes.publicationYear". Support for those fields has been added, including a test case. During this test (#30) a processing gap for names became clear (author may have "given_name" and "surname", but no "name"). This bug has been fixed, too.
*	do not normalize "en dash" in DOI	Martin Czygan	2020-01-17	1	-1/+1
\| \| \| \| \| \| \| \| \|	Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
*	ingest: improve tests, support old ingest results	Bryan Newbold	2020-01-15	2	-1/+2
\|
*	datacite: ignore known unknown values in resourceType*	Martin Czygan	2020-01-09	2	-0/+94
\|
*	datacite: abstracts may be strings or list of strings	Martin Czygan	2020-01-09	4	-0/+186
\|
*	datacite: improve license_slug handling	Martin Czygan	2020-01-09	2	-1/+3
\|
*	datacite: add 'Unknown' to blacklist	Martin Czygan	2020-01-09	1	-7/+1
\|
*	datacite: get rid of schemaVersion	Martin Czygan	2020-01-09	17	-32/+14
\|
*	datacite: reformat test cases and use jq . --sort-keys	Martin Czygan	2020-01-08	54	-2299/+2301
\|
*	datacite: factor out contributor handling	Martin Czygan	2020-01-08	4	-0/+105
\| \| \| \| \| \| \|	Use values from: * attributes.creators[] * attributes.contributors[]
*	datacite: adjust tests for release_month	Martin Czygan	2020-01-08	12	-12/+12
\|
*	datacite: mark additional files as stub	Martin Czygan	2020-01-08	2	-0/+72
\|
*	datacite: CCDC are entries, mostly	Martin Czygan	2020-01-08	1	-1/+1
\|
*	datacite: adding datacite-specific extra metadata	Martin Czygan	2020-01-07	30	-1468/+1570
\| \| \| \| \| \| \| \| \| \| \| \| \|	* attributes.metadataVersion * attributes.schemaVersion * attributes.version (source dependent values, follows suggestions in https://schema.datacite.org/meta/kernel-4.3/doc/DataCite-MetadataKernel_v4.3.pdf#page=26, but values vary) Furthermore: * attributes.types.resourceTypeGeneral * attributes.types.resourceType
*	datacite: month field should be top-level	Martin Czygan	2020-01-06	11	-14/+14
\|
*	datacite: include month in extra	Martin Czygan	2020-01-06	11	-11/+13
\| \| \| \| \|	> include release_month as a top-level extra field [...] to auto-populate the schema field from that
*	datacite: clean abstracts, use unknown value tokens	Martin Czygan	2020-01-06	3	-3/+3
\| \| \| \| \| \| \| \|	Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts.
*	datacite: always include "datacite" key in extra	Martin Czygan	2020-01-04	14	-26/+26
\| \| \| \| \| \|	> always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata.
*	datacite: remove --lang-detect flag	Martin Czygan	2020-01-03	5	-10/+15
\| \| \| \|	Estimated time for a single call is in the order of 50ms.
*	datacite: add another test case	Martin Czygan	2020-01-02	2	-0/+70
\|
*	datacite: open case for editing after creation	Martin Czygan	2020-01-02	1	-0/+2
\|
*	datacite: add helper script to create new test case	Martin Czygan	2020-01-02	1	-0/+14
\|
*	datacite: address raw_name index form comment	Martin Czygan	2020-01-02	19	-111/+111
\| \| \| \| \| \| \| \| \|	> The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically.
*	datacite: add conversion fixtures	Martin Czygan	2020-01-02	49	-0/+3924
\| \| \| \| \| \| \| \| \| \| \| \| \|	The `test_datacite_conversions` function will compare an input (datacite) document to an expected output (release entity as JSON). This way, it should not be too hard to add more cases by adding: input, output - and by increasing the counter in the range loop within the test. To view input and result side by side with vim, change into the test directory and run: tests/files/datacite $ ./caseview.sh 18
*	improve datacite field mapping and import	Martin Czygan	2019-12-28	2	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).