Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | datacite: ignore known unknown values in resourceType* | Martin Czygan | 2020-01-09 | 2 | -0/+94 |
| | |||||
* | datacite: abstracts may be strings or list of strings | Martin Czygan | 2020-01-09 | 4 | -0/+186 |
| | |||||
* | datacite: improve license_slug handling | Martin Czygan | 2020-01-09 | 2 | -1/+3 |
| | |||||
* | datacite: add 'Unknown' to blacklist | Martin Czygan | 2020-01-09 | 1 | -7/+1 |
| | |||||
* | datacite: get rid of schemaVersion | Martin Czygan | 2020-01-09 | 17 | -32/+14 |
| | |||||
* | datacite: reformat test cases and use jq . --sort-keys | Martin Czygan | 2020-01-08 | 54 | -2299/+2301 |
| | |||||
* | datacite: factor out contributor handling | Martin Czygan | 2020-01-08 | 4 | -0/+105 |
| | | | | | | | Use values from: * attributes.creators[] * attributes.contributors[] | ||||
* | datacite: adjust tests for release_month | Martin Czygan | 2020-01-08 | 12 | -12/+12 |
| | |||||
* | datacite: mark additional files as stub | Martin Czygan | 2020-01-08 | 2 | -0/+72 |
| | |||||
* | datacite: CCDC are entries, mostly | Martin Czygan | 2020-01-08 | 1 | -1/+1 |
| | |||||
* | datacite: adding datacite-specific extra metadata | Martin Czygan | 2020-01-07 | 30 | -1468/+1570 |
| | | | | | | | | | | | | | * attributes.metadataVersion * attributes.schemaVersion * attributes.version (source dependent values, follows suggestions in https://schema.datacite.org/meta/kernel-4.3/doc/DataCite-MetadataKernel_v4.3.pdf#page=26, but values vary) Furthermore: * attributes.types.resourceTypeGeneral * attributes.types.resourceType | ||||
* | datacite: month field should be top-level | Martin Czygan | 2020-01-06 | 11 | -14/+14 |
| | |||||
* | datacite: include month in extra | Martin Czygan | 2020-01-06 | 11 | -11/+13 |
| | | | | | > include release_month as a top-level extra field [...] to auto-populate the schema field from that | ||||
* | datacite: clean abstracts, use unknown value tokens | Martin Czygan | 2020-01-06 | 3 | -3/+3 |
| | | | | | | | | Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts. | ||||
* | datacite: always include "datacite" key in extra | Martin Czygan | 2020-01-04 | 14 | -26/+26 |
| | | | | | | > always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata. | ||||
* | datacite: remove --lang-detect flag | Martin Czygan | 2020-01-03 | 5 | -10/+15 |
| | | | | Estimated time for a single call is in the order of 50ms. | ||||
* | datacite: add another test case | Martin Czygan | 2020-01-02 | 2 | -0/+70 |
| | |||||
* | datacite: open case for editing after creation | Martin Czygan | 2020-01-02 | 1 | -0/+2 |
| | |||||
* | datacite: add helper script to create new test case | Martin Czygan | 2020-01-02 | 1 | -0/+14 |
| | |||||
* | datacite: address raw_name index form comment | Martin Czygan | 2020-01-02 | 19 | -111/+111 |
| | | | | | | | | | > The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically. | ||||
* | datacite: add conversion fixtures | Martin Czygan | 2020-01-02 | 49 | -0/+3924 |
| | | | | | | | | | | | | | The `test_datacite_conversions` function will compare an input (datacite) document to an expected output (release entity as JSON). This way, it should not be too hard to add more cases by adding: input, output - and by increasing the counter in the range loop within the test. To view input and result side by side with vim, change into the test directory and run: tests/files/datacite $ ./caseview.sh 18 | ||||
* | improve datacite field mapping and import | Martin Czygan | 2019-12-28 | 2 | -0/+1 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update). | ||||
* | datacite: add simple test and fixture for datacite api interaction | Martin Czygan | 2019-12-27 | 1 | -0/+1 |
| | |||||
* | add regression test for medlinedate -> year parsing | Bryan Newbold | 2019-12-23 | 1 | -0/+95 |
| | |||||
* | add basic test for crossref harvest API call | Bryan Newbold | 2019-12-06 | 1 | -0/+1 |
| | |||||
* | ingest file result importer | Bryan Newbold | 2019-11-15 | 1 | -0/+1 |
| | |||||
* | release elasticsearch results: stage not status | Bryan Newbold | 2019-06-13 | 1 | -1/+1 |
| | |||||
* | JALC bulk file importer | Bryan Newbold | 2019-05-21 | 1 | -0/+100 |
| | |||||
* | basic JALC XML DOI metadata parser | Bryan Newbold | 2019-05-21 | 1 | -0/+176 |
| | |||||
* | basic JSTOR XML parser | Bryan Newbold | 2019-05-21 | 1 | -0/+58 |
| | |||||
* | basic arxivraw XML parser | Bryan Newbold | 2019-05-21 | 1 | -0/+31 |
| | |||||
* | basic pubmed parser | Bryan Newbold | 2019-05-21 | 1 | -0/+36822 |
| | |||||
* | fix releases/release_ids in math_universe.json test file | Bryan Newbold | 2019-05-20 | 1 | -1/+1 |
| | |||||
* | importer code updates | Bryan Newbold | 2019-05-13 | 1 | -1/+1 |
| | |||||
* | update example release JSON to new schema (ext_id, release_stage) | Bryan Newbold | 2019-05-13 | 2 | -11/+11 |
| | |||||
* | arabesque import tests | Bryan Newbold | 2019-04-18 | 2 | -0/+10 |
| | |||||
* | many web test improvements | Bryan Newbold | 2019-04-04 | 2 | -0/+2 |
| | |||||
* | more integration of transform refactor | Bryan Newbold | 2019-03-11 | 1 | -0/+10 |
| | |||||
* | crossref import tweaks/fixes | Bryan Newbold | 2019-01-29 | 1 | -0/+1 |
| | | | | | - refs: article-title not title; save unstructured; authors not author - save 'language' field (already an ISO code) | ||||
* | fix matched test vector | Bryan Newbold | 2019-01-28 | 1 | -1/+1 |
| | | | | this was resulting in a collision with default/example database objects. | ||||
* | update journal meta import/transform | Bryan Newbold | 2019-01-25 | 2 | -10/+20 |
| | |||||
* | tweak crossref import, and update tests | Bryan Newbold | 2019-01-24 | 1 | -4/+20 |
| | |||||
* | allow importing contrib/refs lists | Bryan Newbold | 2019-01-24 | 1 | -0/+0 |
| | | | | | | The motivation here isn't really to support these gigantic lists on principle, but to be able to ingest large corpuses without having to decide whether to filter out or crop such lists. | ||||
* | crossref importer updates | Bryan Newbold | 2019-01-22 | 1 | -1/+1 |
| | |||||
* | fix file extraction (and transforms) | Bryan Newbold | 2018-11-26 | 1 | -0/+1 |
| | |||||
* | improvements to grobid_metadata importer | Bryan Newbold | 2018-09-27 | 1 | -0/+10 |
| | | | | | But still fails tests due to database collision/side-effect on sha1 lookup. | ||||
* | more python example files | Bryan Newbold | 2018-09-22 | 2 | -0/+424 |
| | |||||
* | more matched tests | Bryan Newbold | 2018-09-14 | 1 | -0/+10 |
| | |||||
* | switch manifest importer to be json-based | Bryan Newbold | 2018-09-14 | 1 | -3/+3 |
| | |||||
* | fixes to matched importer (and a test) | Bryan Newbold | 2018-09-14 | 1 | -0/+3 |
| |