summaryrefslogtreecommitdiffstats
path: root/python/tests/import_datacite.py
Commit message (Collapse)AuthorAgeFilesLines
* update datacite tests for license slug changesBryan Newbold2021-11-101-6/+6
| | | | | Use datacite-specific wrapper function, and remove a couple non-OA/TDM-limited licenses.
* remove deprecated extid sqlite3 lookup table feature from importersBryan Newbold2021-11-091-2/+0
| | | | | | | | This was used during initial bulk imports, but is no longer used and could create serious metadata problems if used accidentially. In retrospect, it also made metadata provenance less transparent, and may have done more harm than good overall.
* fmt (black): tests/Bryan Newbold2021-11-021-27/+47
|
* python: isort everythingBryan Newbold2021-11-021-9/+8
|
* lint: simple, safe inline lint fixesBryan Newbold2021-11-021-9/+9
| | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc
* datacite: skip empty abstractsMartin Czygan2021-10-011-1/+1
| | | | | Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required`
* datacite: more careful title string access; fixes sentry #88350Martin Czygan2021-06-111-1/+1
| | | | | Caused by a partial "title entry without title" coming *first* (e.g. just holding, e.g. a language, like: {'lang': 'da'}
* address spammy datacite titlesMartin Czygan2020-09-231-0/+6
| | | | | | | | | seemingly from zenodo: * https://fatcat.wiki/release/rzcpjwukobd4pj36ipla22cnoi * https://doi.org/10.5281/zenodo.4041777 About 3400 records with "FULL MOVIE" in title, currently.
* datacite: handle case of empty-string versionBryan Newbold2020-09-101-0/+1
| | | | | Includes a tiny tweak to the datacite import sample file to test this code path.
* datacite: resolve formatting issues in testsMartin Czygan2020-07-101-2/+5
|\
| * lint (flake8) python test filesBryan Newbold2020-07-011-20/+22
| |
* | wip: contrib, GH59Martin Czygan2020-07-101-229/+361
| |
* | datacite: address duplicated contributor issueMartin Czygan2020-07-071-1/+1
|/ | | | | | | Use string comparison. * https://fatcat.wiki/release/spjysmrnsrgyzgq6ise5o44rlu/contribs * https://api.datacite.org/dois/10.25940/roper-31098406
* datacite: improve license mappingMartin Czygan2020-06-301-0/+14
| | | | via "missed potential license", refs #58
* datacite: hard cast possible date value to stringMartin Czygan2020-06-291-0/+1
|
* datacite: fix type errorMartin Czygan2020-04-221-1/+1
| | | | | | | Up to now, we expected the description to be a string or list. Add handling for int as well. First appeared: Apr 22 19:58:39.
* datacite: fix a raw name constraint violationMartin Czygan2020-04-201-1/+1
| | | | | | | It was possible that contribs got added which had no raw name. One example would be a name consisting of whitespace only. This fix adds a final check for this case.
* datacite: improve date handling and minor tweakMartin Czygan2020-01-301-2/+1
| | | | | | | | | | | | | Records from https://www.micropublication.org/ did not have a date in FC, although raw data contained date strings - they were not using the finer-grained "attributes.date" but "attributes.published" and/or "attributes.publicationYear". Support for those fields has been added, including a test case. During this test (#30) a processing gap for names became clear (author may have "given_name" and "surname", but no "name"). This bug has been fixed, too.
* datacite: add entry to license slug mapMartin Czygan2020-01-091-0/+1
|
* datacite: ignore known unknown values in resourceType*Martin Czygan2020-01-091-1/+1
|
* datacite: abstracts may be strings or list of stringsMartin Czygan2020-01-091-1/+1
|
* datacite: improve license_slug handlingMartin Czygan2020-01-091-1/+30
|
* datacite: factor out contributor handlingMartin Czygan2020-01-081-2/+2
| | | | | | | Use values from: * attributes.creators[] * attributes.contributors[]
* datacite: mark additional files as stubMartin Czygan2020-01-081-1/+1
|
* datacite: indicate mismatched file in testMartin Czygan2020-01-061-1/+1
|
* datacite: use normal.clean_doiMartin Czygan2020-01-031-4/+0
|
* datacite: parse_datacite_dates returns monthMartin Czygan2020-01-031-7/+16
| | | | As [...] we will soon add support for release_month field in the release schema.
* datacite: prepare release_month (stub)Martin Czygan2020-01-031-14/+14
|
* datacite: add another test caseMartin Czygan2020-01-021-1/+1
|
* datacite: address raw_name index form commentMartin Czygan2020-01-021-1/+17
| | | | | | | | | > The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically.
* datacite: add conversion fixturesMartin Czygan2020-01-021-1/+25
| | | | | | | | | | | | | The `test_datacite_conversions` function will compare an input (datacite) document to an expected output (release entity as JSON). This way, it should not be too hard to add more cases by adding: input, output - and by increasing the counter in the range loop within the test. To view input and result side by side with vim, change into the test directory and run: tests/files/datacite $ ./caseview.sh 18
* datacite: adjust testsMartin Czygan2019-12-281-2/+1
|
* address first round of MR14 commentsMartin Czygan2019-12-281-2/+176
| | | | | | | | | | | | | * add missing langdetect * use entity_to_dict for json debug output * factor out code for fields in function and add table driven tests * update citeproc types * add author as default role * add raw_affiliation * include relations from datacite * remove url (covered by doi already) Using yapf for python formatting.
* improve datacite field mapping and importMartin Czygan2019-12-281-17/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
* datacite: importer skeletonMartin Czygan2019-12-281-0/+25
* contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo.