Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | Merge branch 'bnewbold-elastic-v03b' | Bryan Newbold | 2020-02-26 | 3 | -0/+3 |
|\ | |||||
| * | fix some transform bugs, add some tests | Bryan Newbold | 2020-01-29 | 3 | -0/+3 |
| | | |||||
* | | shadow import: more filtering of file_meta fields | Bryan Newbold | 2020-02-13 | 1 | -12/+10 |
| | | |||||
* | | basic shadow importer | Bryan Newbold | 2020-02-13 | 1 | -0/+12 |
| | | |||||
* | | datacite: add exception for https://www.micropublication.org/ | Martin Czygan | 2020-01-31 | 1 | -1/+2 |
| | | |||||
* | | datacite: improve date handling and minor tweak | Martin Czygan | 2020-01-30 | 2 | -0/+110 |
|/ | | | | | | | | | | | | | Records from https://www.micropublication.org/ did not have a date in FC, although raw data contained date strings - they were not using the finer-grained "attributes.date" but "attributes.published" and/or "attributes.publicationYear". Support for those fields has been added, including a test case. During this test (#30) a processing gap for names became clear (author may have "given_name" and "surname", but no "name"). This bug has been fixed, too. | ||||
* | do not normalize "en dash" in DOI | Martin Czygan | 2020-01-17 | 1 | -1/+1 |
| | | | | | | | | | Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid. | ||||
* | ingest: improve tests, support old ingest results | Bryan Newbold | 2020-01-15 | 2 | -1/+2 |
| | |||||
* | datacite: ignore known unknown values in resourceType* | Martin Czygan | 2020-01-09 | 2 | -0/+94 |
| | |||||
* | datacite: abstracts may be strings or list of strings | Martin Czygan | 2020-01-09 | 4 | -0/+186 |
| | |||||
* | datacite: improve license_slug handling | Martin Czygan | 2020-01-09 | 2 | -1/+3 |
| | |||||
* | datacite: add 'Unknown' to blacklist | Martin Czygan | 2020-01-09 | 1 | -7/+1 |
| | |||||
* | datacite: get rid of schemaVersion | Martin Czygan | 2020-01-09 | 17 | -32/+14 |
| | |||||
* | datacite: reformat test cases and use jq . --sort-keys | Martin Czygan | 2020-01-08 | 54 | -2299/+2301 |
| | |||||
* | datacite: factor out contributor handling | Martin Czygan | 2020-01-08 | 4 | -0/+105 |
| | | | | | | | Use values from: * attributes.creators[] * attributes.contributors[] | ||||
* | datacite: adjust tests for release_month | Martin Czygan | 2020-01-08 | 12 | -12/+12 |
| | |||||
* | datacite: mark additional files as stub | Martin Czygan | 2020-01-08 | 2 | -0/+72 |
| | |||||
* | datacite: CCDC are entries, mostly | Martin Czygan | 2020-01-08 | 1 | -1/+1 |
| | |||||
* | datacite: adding datacite-specific extra metadata | Martin Czygan | 2020-01-07 | 30 | -1468/+1570 |
| | | | | | | | | | | | | | * attributes.metadataVersion * attributes.schemaVersion * attributes.version (source dependent values, follows suggestions in https://schema.datacite.org/meta/kernel-4.3/doc/DataCite-MetadataKernel_v4.3.pdf#page=26, but values vary) Furthermore: * attributes.types.resourceTypeGeneral * attributes.types.resourceType | ||||
* | datacite: month field should be top-level | Martin Czygan | 2020-01-06 | 11 | -14/+14 |
| | |||||
* | datacite: include month in extra | Martin Czygan | 2020-01-06 | 11 | -11/+13 |
| | | | | | > include release_month as a top-level extra field [...] to auto-populate the schema field from that | ||||
* | datacite: clean abstracts, use unknown value tokens | Martin Czygan | 2020-01-06 | 3 | -3/+3 |
| | | | | | | | | Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts. | ||||
* | datacite: always include "datacite" key in extra | Martin Czygan | 2020-01-04 | 14 | -26/+26 |
| | | | | | | > always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata. | ||||
* | datacite: remove --lang-detect flag | Martin Czygan | 2020-01-03 | 5 | -10/+15 |
| | | | | Estimated time for a single call is in the order of 50ms. | ||||
* | datacite: add another test case | Martin Czygan | 2020-01-02 | 2 | -0/+70 |
| | |||||
* | datacite: open case for editing after creation | Martin Czygan | 2020-01-02 | 1 | -0/+2 |
| | |||||
* | datacite: add helper script to create new test case | Martin Czygan | 2020-01-02 | 1 | -0/+14 |
| | |||||
* | datacite: address raw_name index form comment | Martin Czygan | 2020-01-02 | 19 | -111/+111 |
| | | | | | | | | | > The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically. | ||||
* | datacite: add conversion fixtures | Martin Czygan | 2020-01-02 | 49 | -0/+3924 |
| | | | | | | | | | | | | | The `test_datacite_conversions` function will compare an input (datacite) document to an expected output (release entity as JSON). This way, it should not be too hard to add more cases by adding: input, output - and by increasing the counter in the range loop within the test. To view input and result side by side with vim, change into the test directory and run: tests/files/datacite $ ./caseview.sh 18 | ||||
* | improve datacite field mapping and import | Martin Czygan | 2019-12-28 | 2 | -0/+1 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update). | ||||
* | datacite: add simple test and fixture for datacite api interaction | Martin Czygan | 2019-12-27 | 1 | -0/+1 |
| | |||||
* | add regression test for medlinedate -> year parsing | Bryan Newbold | 2019-12-23 | 1 | -0/+95 |
| | |||||
* | add basic test for crossref harvest API call | Bryan Newbold | 2019-12-06 | 1 | -0/+1 |
| | |||||
* | ingest file result importer | Bryan Newbold | 2019-11-15 | 1 | -0/+1 |
| | |||||
* | release elasticsearch results: stage not status | Bryan Newbold | 2019-06-13 | 1 | -1/+1 |
| | |||||
* | JALC bulk file importer | Bryan Newbold | 2019-05-21 | 1 | -0/+100 |
| | |||||
* | basic JALC XML DOI metadata parser | Bryan Newbold | 2019-05-21 | 1 | -0/+176 |
| | |||||
* | basic JSTOR XML parser | Bryan Newbold | 2019-05-21 | 1 | -0/+58 |
| | |||||
* | basic arxivraw XML parser | Bryan Newbold | 2019-05-21 | 1 | -0/+31 |
| | |||||
* | basic pubmed parser | Bryan Newbold | 2019-05-21 | 1 | -0/+36822 |
| | |||||
* | fix releases/release_ids in math_universe.json test file | Bryan Newbold | 2019-05-20 | 1 | -1/+1 |
| | |||||
* | importer code updates | Bryan Newbold | 2019-05-13 | 1 | -1/+1 |
| | |||||
* | update example release JSON to new schema (ext_id, release_stage) | Bryan Newbold | 2019-05-13 | 2 | -11/+11 |
| | |||||
* | arabesque import tests | Bryan Newbold | 2019-04-18 | 2 | -0/+10 |
| | |||||
* | many web test improvements | Bryan Newbold | 2019-04-04 | 2 | -0/+2 |
| | |||||
* | more integration of transform refactor | Bryan Newbold | 2019-03-11 | 1 | -0/+10 |
| | |||||
* | crossref import tweaks/fixes | Bryan Newbold | 2019-01-29 | 1 | -0/+1 |
| | | | | | - refs: article-title not title; save unstructured; authors not author - save 'language' field (already an ISO code) | ||||
* | fix matched test vector | Bryan Newbold | 2019-01-28 | 1 | -1/+1 |
| | | | | this was resulting in a collision with default/example database objects. | ||||
* | update journal meta import/transform | Bryan Newbold | 2019-01-25 | 2 | -10/+20 |
| | |||||
* | tweak crossref import, and update tests | Bryan Newbold | 2019-01-24 | 1 | -4/+20 |
| |