aboutsummaryrefslogtreecommitdiffstats
path: root/python/tests
Commit message (Collapse)AuthorAgeFilesLines
* catch ApiValueError in some generic API callsBryan Newbold2020-03-251-0/+2
| | | | | | | | | | | | | The motivation for this change is to handle bogus revision IDs in URLs, which were causing 500 errors not 400 errors. Eg: https://qa.fatcat.wiki/file/rev/5d5d5162-b676-4f0a-968f-e19dadeaf96e%2B2019-11-27%2B13:49:51%2B0%2B6 I have no idea where these URLs are actually coming from, but they should be 4xx not 5xx. Investigating made me realize there is a whole category of ApiValueError exceptions we were not catching and should have been.
* pubmed: handle multiple ReferenceListBryan Newbold2020-03-202-0/+218
| | | | | | | This resolves a situation noticed in prod where we were only importing/updating a single reference per article. Includes a regression test.
* Merge branch 'martin-kafka-bs4-import' into 'master'Martin Czygan2020-03-103-0/+80
|\ | | | | | | | | pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28
| * pubmed: move mapping generation out of fetch_dateMartin Czygan2020-03-101-0/+2
| | | | | | | | | | * fetch_date will fail on missing mapping * adjust tests (test will require access to pubmed ftp)
| * more pubmed adjustmentsMartin Czygan2020-02-223-0/+78
| | | | | | | | | | * regenerate map in continuous mode * add tests
* | Merge branch 'bnewbold-elastic-v03b'Bryan Newbold2020-02-264-4/+61
|\ \
| * | ES updates: fix tests to accept archive.org in host/domainBryan Newbold2020-02-141-2/+3
| | |
| * | ES releases: host/domain fixesBryan Newbold2020-01-311-0/+3
| | |
| * | implement host+domain parsing for file ES transformBryan Newbold2020-01-301-4/+3
| | |
| * | fix ES file schema plural field namesBryan Newbold2020-01-291-1/+1
| | |
| * | actually implement changelog transformBryan Newbold2020-01-291-1/+23
| | |
| * | fix some transform bugs, add some testsBryan Newbold2020-01-294-5/+16
| | |
| * | first implementation of ES file schemaBryan Newbold2020-01-291-2/+23
| | | | | | | | | | | | | | | Includes a trivial test and transform, but not any workers or doc updates.
* | | shadow import: more filtering of file_meta fieldsBryan Newbold2020-02-132-18/+18
| | |
* | | basic shadow importerBryan Newbold2020-02-132-0/+71
| |/ |/|
* | datacite: add exception for https://www.micropublication.org/Martin Czygan2020-01-311-1/+2
| |
* | datacite: improve date handling and minor tweakMartin Czygan2020-01-303-2/+111
|/ | | | | | | | | | | | | Records from https://www.micropublication.org/ did not have a date in FC, although raw data contained date strings - they were not using the finer-grained "attributes.date" but "attributes.published" and/or "attributes.publicationYear". Support for those fields has been added, including a test case. During this test (#30) a processing gap for names became clear (author may have "given_name" and "surname", but no "name"). This bug has been fixed, too.
* do not normalize "en dash" in DOIMartin Czygan2020-01-171-1/+1
| | | | | | | | | Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
* ingest: improve tests, support old ingest resultsBryan Newbold2020-01-153-1/+18
|
* datacite: add entry to license slug mapMartin Czygan2020-01-091-0/+1
|
* datacite: ignore known unknown values in resourceType*Martin Czygan2020-01-093-1/+95
|
* datacite: abstracts may be strings or list of stringsMartin Czygan2020-01-095-1/+187
|
* datacite: improve license_slug handlingMartin Czygan2020-01-093-2/+33
|
* datacite: add 'Unknown' to blacklistMartin Czygan2020-01-091-7/+1
|
* datacite: get rid of schemaVersionMartin Czygan2020-01-0917-32/+14
|
* datacite: reformat test cases and use jq . --sort-keysMartin Czygan2020-01-0854-2299/+2301
|
* datacite: factor out contributor handlingMartin Czygan2020-01-085-2/+107
| | | | | | | Use values from: * attributes.creators[] * attributes.contributors[]
* datacite: adjust tests for release_monthMartin Czygan2020-01-0812-12/+12
|
* datacite: mark additional files as stubMartin Czygan2020-01-083-1/+73
|
* datacite: CCDC are entries, mostlyMartin Czygan2020-01-081-1/+1
|
* datacite: adding datacite-specific extra metadataMartin Czygan2020-01-0730-1468/+1570
| | | | | | | | | | | | | * attributes.metadataVersion * attributes.schemaVersion * attributes.version (source dependent values, follows suggestions in https://schema.datacite.org/meta/kernel-4.3/doc/DataCite-MetadataKernel_v4.3.pdf#page=26, but values vary) Furthermore: * attributes.types.resourceTypeGeneral * attributes.types.resourceType
* datacite: month field should be top-levelMartin Czygan2020-01-0611-14/+14
|
* datacite: include month in extraMartin Czygan2020-01-0611-11/+13
| | | | | > include release_month as a top-level extra field [...] to auto-populate the schema field from that
* datacite: indicate mismatched file in testMartin Czygan2020-01-061-1/+1
|
* datacite: clean abstracts, use unknown value tokensMartin Czygan2020-01-063-3/+3
| | | | | | | | Datacite defines placeholders for unknown values: * https://support.datacite.org/docs/schema-values-unknown-information-v43 Clean abstracts.
* datacite: always include "datacite" key in extraMartin Czygan2020-01-0414-26/+26
| | | | | | > always include extra values for the respective DOI registrars (datacite, crossref, jalc), even if they are empty ({}), to be used as a flag so we know which DOI registrar supplied the metadata.
* datacite: use normal.clean_doiMartin Czygan2020-01-031-4/+0
|
* datacite: parse_datacite_dates returns monthMartin Czygan2020-01-031-7/+16
| | | | As [...] we will soon add support for release_month field in the release schema.
* datacite: prepare release_month (stub)Martin Czygan2020-01-031-14/+14
|
* datacite: remove --lang-detect flagMartin Czygan2020-01-035-10/+15
| | | | Estimated time for a single call is in the order of 50ms.
* datacite: add another test caseMartin Czygan2020-01-023-1/+71
|
* datacite: open case for editing after creationMartin Czygan2020-01-021-0/+2
|
* datacite: add helper script to create new test caseMartin Czygan2020-01-021-0/+14
|
* datacite: address raw_name index form commentMartin Czygan2020-01-0220-112/+128
| | | | | | | | | > The convention for display_name and raw_name is to be how the name would normally be printed, not in index form (surname comma given_name). So we might need to un-encode names like "Tricart, Pierre". Use an additional `index_form_to_display_name` function to convert index from to display form, heuristically.
* datacite: add conversion fixturesMartin Czygan2020-01-0250-1/+3949
| | | | | | | | | | | | | The `test_datacite_conversions` function will compare an input (datacite) document to an expected output (release entity as JSON). This way, it should not be too hard to add more cases by adding: input, output - and by increasing the counter in the range loop within the test. To view input and result side by side with vim, change into the test directory and run: tests/files/datacite $ ./caseview.sh 18
* datacite: adjust testsMartin Czygan2019-12-281-2/+1
|
* address first round of MR14 commentsMartin Czygan2019-12-281-2/+176
| | | | | | | | | | | | | * add missing langdetect * use entity_to_dict for json debug output * factor out code for fields in function and add table driven tests * update citeproc types * add author as default role * add raw_affiliation * include relations from datacite * remove url (covered by doi already) Using yapf for python formatting.
* improve datacite field mapping and importMartin Czygan2019-12-283-17/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
* datacite: importer skeletonMartin Czygan2019-12-281-0/+25
| | | | | | * contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo.
* datacite: fix harvest testMartin Czygan2019-12-271-1/+1
| | | | | | Produced messages should match: jq '.data|length' tests/files/datacite_api.json