aboutsummaryrefslogtreecommitdiffstats
path: root/python/tests
Commit message (Collapse)AuthorAgeFilesLines
...
* | | improve citeproc/CSL web interfaceBryan Newbold2020-03-252-13/+53
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | This tries to show the citeproc (bibtext, MLA, CSL-JSON) options for more releases, and not show the links when they would break. The primary motivation here is to work around two exceptions being thrown in prod every day (according to sentry): KeyError: 'role' ValueError: CLS requries some surname (family name) I'm guessing these are mostly coming from crawlers following the citeproc links on release landing pages.
* | pubmed: handle multiple ReferenceListBryan Newbold2020-03-202-0/+218
| | | | | | | | | | | | | | This resolves a situation noticed in prod where we were only importing/updating a single reference per article. Includes a regression test.
* | Merge branch 'martin-kafka-bs4-import' into 'master'Martin Czygan2020-03-103-0/+80
|\ \ | |/ |/| | | | | pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28
| * pubmed: move mapping generation out of fetch_dateMartin Czygan2020-03-101-0/+2
| | | | | | | | | | * fetch_date will fail on missing mapping * adjust tests (test will require access to pubmed ftp)
| * more pubmed adjustmentsMartin Czygan2020-02-223-0/+78
| | | | | | | | | | * regenerate map in continuous mode * add tests
* | Merge branch 'bnewbold-elastic-v03b'Bryan Newbold2020-02-264-4/+61
|\ \
| * | ES updates: fix tests to accept archive.org in host/domainBryan Newbold2020-02-141-2/+3
| | |
| * | ES releases: host/domain fixesBryan Newbold2020-01-311-0/+3
| | |
| * | implement host+domain parsing for file ES transformBryan Newbold2020-01-301-4/+3
| | |
| * | fix ES file schema plural field namesBryan Newbold2020-01-291-1/+1
| | |
| * | actually implement changelog transformBryan Newbold2020-01-291-1/+23
| | |
| * | fix some transform bugs, add some testsBryan Newbold2020-01-294-5/+16
| | |
| * | first implementation of ES file schemaBryan Newbold2020-01-291-2/+23
| | | | | | | | | | | | | | | Includes a trivial test and transform, but not any workers or doc updates.
* | | shadow import: more filtering of file_meta fieldsBryan Newbold2020-02-132-18/+18
| | |
* | | basic shadow importerBryan Newbold2020-02-132-0/+71
| |/ |/|
* | datacite: add exception for https://www.micropublication.org/Martin Czygan2020-01-311-1/+2
| |
* | datacite: improve date handling and minor tweakMartin Czygan2020-01-303-2/+111
|/ | | | | | | | | | | | | Records from https://www.micropublication.org/ did not have a date in FC, although raw data contained date strings - they were not using the finer-grained "attributes.date" but "attributes.published" and/or "attributes.publicationYear". Support for those fields has been added, including a test case. During this test (#30) a processing gap for names became clear (author may have "given_name" and "surname", but no "name"). This bug has been fixed, too.
* do not normalize "en dash" in DOIMartin Czygan2020-01-171-1/+1
| | | | | | | | | Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
* ingest: improve tests, support old ingest resultsBryan Newbold2020-01-153-1/+18
|
* datacite: add entry to license slug mapMartin Czygan2020-01-091-0/+1
|
* datacite: ignore known unknown values in resourceType*Martin Czygan2020-01-093-1/+95
|
* datacite: abstracts may be strings or list of stringsMartin Czygan2020-01-095-1/+187
|
* datacite: improve license_slug handlingMartin Czygan2020-01-093-2/+33
|
* datacite: add 'Unknown' to blacklistMartin Czygan2020-01-091-7/+1
|
* datacite: get rid of schemaVersionMartin Czygan2020-01-0917-32/+14
|
* datacite: reformat test cases and use jq . --sort-keysMartin Czygan2020-01-0854-2299/+2301
|
* datacite: factor out contributor handlingMartin Czygan2020-01-085-2/+107
| | | | | | | Use values from: * attributes.creators[] * attributes.contributors[]
* datacite: adjust tests for release_monthMartin Czygan2020-01-0812-12/+12
|
* datacite: mark additional files as stubMartin Czygan2020-01-083-1/+73
|
* datacite: CCDC are entries, mostlyMartin Czygan2020-01-081-1/+1
|
* datacite: adding datacite-specific extra metadataMartin Czygan2020-01-0730-1468/+1570
| | | | | | | | | | | | | * attributes.metadataVersion * attributes.schemaVersion * attributes.version (source dependent values, follows suggestions in https://schema.datacite.org/meta/kernel-4.3/doc/DataCite-MetadataKernel_v4.3.pdf#page=26, but values vary) Furthermore: * attributes.types.resourceTypeGeneral * attributes.types.resourceType
* datacite: month field should be top-levelMartin Czygan2020-01-0611-14/+14
|
* datacite: include month in extraMartin Czygan2020-01-0611-11/+13
| | | | | > include release_month as a top-level extra field [...] to auto-populate the schema field from that
* datacite: indicate mismatched file in testMartin Czygan2020-01-061-1/+1
|
* datacite: clean abstracts, use unknown value tokensMartin Czygan2020-01-063-3/+3
| | | | | | | | Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts.
* datacite: always include "datacite" key in extraMartin Czygan2020-01-0414-26/+26
| | | | | | > always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata.
* datacite: use normal.clean_doiMartin Czygan2020-01-031-4/+0
|
* datacite: parse_datacite_dates returns monthMartin Czygan2020-01-031-7/+16
| | | | As [...] we will soon add support for release_month field in the release schema.
* datacite: prepare release_month (stub)Martin Czygan2020-01-031-14/+14
|
* datacite: remove --lang-detect flagMartin Czygan2020-01-035-10/+15
| | | | Estimated time for a single call is in the order of 50ms.
* datacite: add another test caseMartin Czygan2020-01-023-1/+71
|
* datacite: open case for editing after creationMartin Czygan2020-01-021-0/+2
|
* datacite: add helper script to create new test caseMartin Czygan2020-01-021-0/+14
|
* datacite: address raw_name index form commentMartin Czygan2020-01-0220-112/+128
| | | | | | | | | > The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically.
* datacite: add conversion fixturesMartin Czygan2020-01-0250-1/+3949
| | | | | | | | | | | | | The `test_datacite_conversions` function will compare an input (datacite) document to an expected output (release entity as JSON). This way, it should not be too hard to add more cases by adding: input, output - and by increasing the counter in the range loop within the test. To view input and result side by side with vim, change into the test directory and run: tests/files/datacite $ ./caseview.sh 18
* datacite: adjust testsMartin Czygan2019-12-281-2/+1
|
* address first round of MR14 commentsMartin Czygan2019-12-281-2/+176
| | | | | | | | | | | | | * add missing langdetect * use entity_to_dict for json debug output * factor out code for fields in function and add table driven tests * update citeproc types * add author as default role * add raw_affiliation * include relations from datacite * remove url (covered by doi already) Using yapf for python formatting.
* improve datacite field mapping and importMartin Czygan2019-12-283-17/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
* datacite: importer skeletonMartin Czygan2019-12-281-0/+25
| | | | | | * contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo.
* datacite: fix harvest testMartin Czygan2019-12-271-1/+1
| | | | | | Produced messages should match: jq '.data|length' tests/files/datacite_api.json