Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | datacite: add two more skipable tokens | Martin Czygan | 2020-01-02 | 1 | -1/+1 |
| | |||||
* | datacite: add conversion fixtures | Martin Czygan | 2020-01-02 | 50 | -1/+3949 |
| | | | | | | | | | | | | | The `test_datacite_conversions` function will compare an input (datacite) document to an expected output (release entity as JSON). This way, it should not be too hard to add more cases by adding: input, output - and by increasing the counter in the range loop within the test. To view input and result side by side with vim, change into the test directory and run: tests/files/datacite $ ./caseview.sh 18 | ||||
* | datacite: names can be 'Unav', too | Martin Czygan | 2020-01-02 | 1 | -1/+4 |
| | |||||
* | datacite: avoid more None values | Martin Czygan | 2020-01-01 | 1 | -4/+4 |
| | |||||
* | datacite: address 'Unpublished' publisher | Martin Czygan | 2019-12-31 | 1 | -9/+10 |
| | |||||
* | datacite: ensure name schema is defined | Martin Czygan | 2019-12-31 | 1 | -1/+2 |
| | |||||
* | datacite: fix typo | Martin Czygan | 2019-12-31 | 1 | -1/+1 |
| | |||||
* | datacite: isascii was added in 3.7, only | Martin Czygan | 2019-12-31 | 1 | -1/+7 |
| | |||||
* | datacite: skip non-ascii doi for now | Martin Czygan | 2019-12-31 | 1 | -0/+4 |
| | | | | | | Example of a non-ascii doi: * https://doi.org/10.13125/américacrítica/3017 | ||||
* | datacite: clean doi | Martin Czygan | 2019-12-31 | 1 | -1/+13 |
| | | | | | | | address issue with EN DASH DOI. > "external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.25513/1812-3996.2017.1.34–42" | ||||
* | datacite: update docs | Martin Czygan | 2019-12-31 | 1 | -9/+9 |
| | |||||
* | datacite: perform additional checks on contrib | Martin Czygan | 2019-12-30 | 1 | -3/+9 |
| | |||||
* | datacite: check for empty title after clean | Martin Czygan | 2019-12-29 | 1 | -2/+5 |
| | |||||
* | datacite: update docs with observed values | Martin Czygan | 2019-12-29 | 1 | -1/+3 |
| | |||||
* | datacite: page number misses are too common | Martin Czygan | 2019-12-28 | 1 | -1/+2 |
| | | | | | | Should be a level debug, not info. Examples: E675, n/a, 15D.2.1, 15D.2.1, A.1E.1, A.1E.1, ... | ||||
* | datacite: suppress debug-like language lookup miss message | Martin Czygan | 2019-12-28 | 1 | -1/+3 |
| | |||||
* | datacite: adjust tests | Martin Czygan | 2019-12-28 | 1 | -2/+1 |
| | |||||
* | datacite: treat untyped names as people | Martin Czygan | 2019-12-28 | 1 | -1/+1 |
| | |||||
* | datacite: include container_name top level key in extra | Martin Czygan | 2019-12-28 | 1 | -7/+21 |
| | |||||
* | datacite: use clean on field values | Martin Czygan | 2019-12-28 | 1 | -2/+28 |
| | |||||
* | datacite: include doi in error messages | Martin Czygan | 2019-12-28 | 1 | -8/+8 |
| | |||||
* | remove langcodes dependency | Martin Czygan | 2019-12-28 | 2 | -15/+0 |
| | |||||
* | datacite: limit abstract length | Martin Czygan | 2019-12-28 | 1 | -0/+6 |
| | |||||
* | datacite: use iso 639-1 codes | Martin Czygan | 2019-12-28 | 1 | -7/+4 |
| | |||||
* | datacite: use specific auth var | Martin Czygan | 2019-12-28 | 1 | -1/+1 |
| | |||||
* | datacite: add missing --extid-map-file flag | Martin Czygan | 2019-12-28 | 1 | -0/+4 |
| | |||||
* | address first round of MR14 comments | Martin Czygan | 2019-12-28 | 4 | -150/+503 |
| | | | | | | | | | | | | | * add missing langdetect * use entity_to_dict for json debug output * factor out code for fields in function and add table driven tests * update citeproc types * add author as default role * add raw_affiliation * include relations from datacite * remove url (covered by doi already) Using yapf for python formatting. | ||||
* | datacite: move common date patterns out of the loop | Martin Czygan | 2019-12-28 | 1 | -3/+4 |
| | | | | Additionally, try the unspecific (%Y) pattern last. | ||||
* | improve datacite field mapping and import | Martin Czygan | 2019-12-28 | 5 | -59/+245 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update). | ||||
* | datacite: add missing mappings and notes | Martin Czygan | 2019-12-28 | 1 | -266/+175 |
| | |||||
* | datacite: basic field mappings | Martin Czygan | 2019-12-28 | 1 | -41/+181 |
| | | | | | | | | | | Currently using two external libraries: * dateparser * langcodes Note: This commit includes lots of wip docs and field stat in comment, which should be removed. | ||||
* | datacite: importer skeleton | Martin Czygan | 2019-12-28 | 4 | -0/+514 |
| | | | | | | * contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo. | ||||
* | orcid: skip non-person ORCID records | Bryan Newbold | 2019-12-26 | 1 | -0/+4 |
| | |||||
* | datacite: fix harvest test | Martin Czygan | 2019-12-27 | 1 | -1/+1 |
| | | | | | | Produced messages should match: jq '.data|length' tests/files/datacite_api.json | ||||
* | datacite: add simple test and fixture for datacite api interaction | Martin Czygan | 2019-12-27 | 2 | -0/+46 |
| | |||||
* | datacite: extend range search query | Martin Czygan | 2019-12-27 | 1 | -1/+1 |
| | | | | | The bracket syntax is inclusive. See also: https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-query-string-query.html#_ranges | ||||
* | avoid usage of short links | Martin Czygan | 2019-12-27 | 1 | -2/+2 |
| | |||||
* | Datacite API v2 throws 400, we cannot recover from, currently. | Martin Czygan | 2019-12-27 | 1 | -0/+4 |
| | | | | | | | | | | As a first iteration, just mark the daily batch complete and continue. The occasional HTTP 400 issue has been reported as https://github.com/datacite/datacite/issues/897. A possible improvement would be to shrink the window, so losses will be smaller. | ||||
* | datacite: update documentation, add links to issues | Martin Czygan | 2019-12-27 | 1 | -10/+5 |
| | |||||
* | datacite: use v2 of the API (flaky) | Martin Czygan | 2019-12-27 | 1 | -5/+28 |
| | | | | | | | | | Update parameter update for datacite API v2. Works fine, but there are occasional HTTP 400 responses when using the cursor API (daily updates can exceed the 10000 record limit for search queries). The HTTP 400 issue is not solved yet, but reported to datacite as https://github.com/datacite/datacite/issues/897. | ||||
* | transform ingests via pmc/pmcid, not pubmed/pmid | Bryan Newbold | 2019-12-24 | 1 | -4/+4 |
| | |||||
* | allow arabesque backfill ingests for some source types | Bryan Newbold | 2019-12-24 | 1 | -0/+5 |
| | |||||
* | make chocula URL updates more conservative | Bryan Newbold | 2019-12-24 | 1 | -5/+5 |
| | |||||
* | pubmed: if doing update, also do subtitle schema update | Bryan Newbold | 2019-12-23 | 1 | -1/+9 |
| | |||||
* | doi parsing fixes | Bryan Newbold | 2019-12-23 | 1 | -0/+7 |
| | | | | | | | | | | Replace emdash with regular dash. Replace double slash after partner ID with single slash. This conversion seems to be done by crossref automatically on lookup. I tried several examples, using doi.org resolver and Crossref API lookup. Note that there are a number of fatcat entities with '//' in the DOI. | ||||
* | pubmed: improve warning and stderr formatting | Bryan Newbold | 2019-12-23 | 1 | -5/+6 |
| | |||||
* | pubmed: use standard identifier cleaners | Bryan Newbold | 2019-12-23 | 1 | -17/+14 |
| | |||||
* | pubmed: remove unused extid mapping code | Bryan Newbold | 2019-12-23 | 1 | -29/+0 |
| | |||||
* | pubmed: do reference lookups by default | Bryan Newbold | 2019-12-23 | 1 | -1/+1 |
| | |||||
* | normalizers: clean_pmid(), and handle nulls in all other cleaners | Bryan Newbold | 2019-12-23 | 1 | -0/+31 |
| |