aboutsummaryrefslogtreecommitdiffstats
path: root/python/tests/files
Commit message (Collapse)AuthorAgeFilesLines
...
* datacite: clean abstracts, use unknown value tokensMartin Czygan2020-01-063-3/+3
| | | | | | | | Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts.
* datacite: always include "datacite" key in extraMartin Czygan2020-01-0414-26/+26
| | | | | | > always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata.
* datacite: remove --lang-detect flagMartin Czygan2020-01-035-10/+15
| | | | Estimated time for a single call is in the order of 50ms.
* datacite: add another test caseMartin Czygan2020-01-022-0/+70
|
* datacite: open case for editing after creationMartin Czygan2020-01-021-0/+2
|
* datacite: add helper script to create new test caseMartin Czygan2020-01-021-0/+14
|
* datacite: address raw_name index form commentMartin Czygan2020-01-0219-111/+111
| | | | | | | | | > The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically.
* datacite: add conversion fixturesMartin Czygan2020-01-0249-0/+3924
| | | | | | | | | | | | | The `test_datacite_conversions` function will compare an input (datacite) document to an expected output (release entity as JSON). This way, it should not be too hard to add more cases by adding: input, output - and by increasing the counter in the range loop within the test. To view input and result side by side with vim, change into the test directory and run: tests/files/datacite $ ./caseview.sh 18
* improve datacite field mapping and importMartin Czygan2019-12-282-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
* datacite: add simple test and fixture for datacite api interactionMartin Czygan2019-12-271-0/+1
|
* add regression test for medlinedate -> year parsingBryan Newbold2019-12-231-0/+95
|
* add basic test for crossref harvest API callBryan Newbold2019-12-061-0/+1
|
* ingest file result importerBryan Newbold2019-11-151-0/+1
|
* release elasticsearch results: stage not statusBryan Newbold2019-06-131-1/+1
|
* JALC bulk file importerBryan Newbold2019-05-211-0/+100
|
* basic JALC XML DOI metadata parserBryan Newbold2019-05-211-0/+176
|
* basic JSTOR XML parserBryan Newbold2019-05-211-0/+58
|
* basic arxivraw XML parserBryan Newbold2019-05-211-0/+31
|
* basic pubmed parserBryan Newbold2019-05-211-0/+36822
|
* fix releases/release_ids in math_universe.json test fileBryan Newbold2019-05-201-1/+1
|
* importer code updatesBryan Newbold2019-05-131-1/+1
|
* update example release JSON to new schema (ext_id, release_stage)Bryan Newbold2019-05-132-11/+11
|
* arabesque import testsBryan Newbold2019-04-182-0/+10
|
* many web test improvementsBryan Newbold2019-04-042-0/+2
|
* more integration of transform refactorBryan Newbold2019-03-111-0/+10
|
* crossref import tweaks/fixesBryan Newbold2019-01-291-0/+1
| | | | | - refs: article-title not title; save unstructured; authors not author - save 'language' field (already an ISO code)
* fix matched test vectorBryan Newbold2019-01-281-1/+1
| | | | this was resulting in a collision with default/example database objects.
* update journal meta import/transformBryan Newbold2019-01-252-10/+20
|
* tweak crossref import, and update testsBryan Newbold2019-01-241-4/+20
|
* allow importing contrib/refs listsBryan Newbold2019-01-241-0/+0
| | | | | | The motivation here isn't really to support these gigantic lists on principle, but to be able to ingest large corpuses without having to decide whether to filter out or crop such lists.
* crossref importer updatesBryan Newbold2019-01-221-1/+1
|
* fix file extraction (and transforms)Bryan Newbold2018-11-261-0/+1
|
* improvements to grobid_metadata importerBryan Newbold2018-09-271-0/+10
| | | | | But still fails tests due to database collision/side-effect on sha1 lookup.
* more python example filesBryan Newbold2018-09-222-0/+424
|
* more matched testsBryan Newbold2018-09-141-0/+10
|
* switch manifest importer to be json-basedBryan Newbold2018-09-141-3/+3
|
* fixes to matched importer (and a test)Bryan Newbold2018-09-141-0/+3
|
* extid support for crossref importerBryan Newbold2018-09-121-0/+0
|
* fix python import of ORCIDs ending 'X'Bryan Newbold2018-09-101-0/+1
|
* improve handling of invalid identifiersBryan Newbold2018-08-152-0/+2
|
* ISSN importerBryan Newbold2018-06-211-0/+10
|
* importer tests and fixesBryan Newbold2018-06-202-10/+10
|
* more progress on crossref+orcid importersBryan Newbold2018-06-202-0/+13
|
* basic ORCID importerBryan Newbold2018-06-091-0/+1
|
* move python code to subdirectoryBryan Newbold2018-05-161-0/+10