aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
| * datacite: limit abstract lengthMartin Czygan2019-12-281-0/+6
| |
| * datacite: use iso 639-1 codesMartin Czygan2019-12-281-7/+4
| |
| * datacite: use specific auth varMartin Czygan2019-12-281-1/+1
| |
| * datacite: add missing --extid-map-file flagMartin Czygan2019-12-281-0/+4
| |
| * address first round of MR14 commentsMartin Czygan2019-12-284-150/+503
| | | | | | | | | | | | | | | | | | | | | | | | | | * add missing langdetect * use entity_to_dict for json debug output * factor out code for fields in function and add table driven tests * update citeproc types * add author as default role * add raw_affiliation * include relations from datacite * remove url (covered by doi already) Using yapf for python formatting.
| * datacite: move common date patterns out of the loopMartin Czygan2019-12-281-3/+4
| | | | | | | | Additionally, try the unspecific (%Y) pattern last.
| * improve datacite field mapping and importMartin Czygan2019-12-285-59/+245
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version succeeded to import a random sample of 100000 records (0.5%) from datacite. The --debug (write JSON to stdout) and --insert-log-file (log batch before committing to db) flags are temporary added to help debugging. Add few unit tests. Some edge cases: a) Existing keys without value requires a slightly awkward: ``` titles = attributes.get('titles', []) or [] ``` b) There can be 0, 1, or more (first one wins) titles. c) Date handling is probably not ideal. Datacite has a potentiall fine grained list of dates. The test case (tests/files/datacite_sample.jsonl) refers to https://ssl.fao.org/glis/doi/10.18730/8DYM9, which has date (main descriptor) 1986. The datacite record contains: 2017 (publicationYear, probably the year of record creation with reference system), 1978-06-03 (collected, e.g. experimental sample), 1986 ("Accepted"). The online version of the resource knows even one more date (2019-06-05 10:14:43 by WIEWS update).
| * datacite: add missing mappings and notesMartin Czygan2019-12-281-266/+175
| |
| * datacite: basic field mappingsMartin Czygan2019-12-281-41/+181
| | | | | | | | | | | | | | | | | | | | Currently using two external libraries: * dateparser * langcodes Note: This commit includes lots of wip docs and field stat in comment, which should be removed.
| * datacite: importer skeletonMartin Czygan2019-12-284-0/+514
| | | | | | | | | | | | * contributors, title, date, publisher, container, license Field and value analysis via https://github.com/miku/indigo.
* | 2019-01-07 status updateBryan Newbold2020-01-072-0/+36
| |
* | chocula bulk edit noteBryan Newbold2020-01-072-0/+15
| |
* | importers: control update behavior with more-standard flagBryan Newbold2020-01-066-3/+15
| |
* | proposals: standardize a bitBryan Newbold2020-01-039-3/+34
| |
* | notes on search query parsing (WIP)Bryan Newbold2020-01-031-0/+22
| |
* | fatcat identifiers proposal (WIP)Bryan Newbold2020-01-031-0/+25
| |
* | proposal: python3.7 upgradeBryan Newbold2020-01-031-0/+101
| |
* | pipenv: update pytest to 5.x; remove langcodesBryan Newbold2020-01-032-108/+85
| | | | | | | | | | | | | | | | | | | | pytest has been pinned to the 4.x series to work around a test import package mangling problem with citeproc_styles. Now that pytest.ini explicitly lists test files, this seems to no longer be a problem and pytest can be updated to the most recent version. Also re-locked Pipfile.lock with updated dependencies (only minor changes).
* | pytest: explicitly indicate all in-scope test filesBryan Newbold2020-01-031-3/+1
| | | | | | | | | | | | | | | | | | | | | | The purpose of this change is to test errors when pytest tries to recursively update assertion statements in all dependent packages. The reason pytest does this is to add pretty printing, which is nice, but probably shouldn't be done in all dependency libraries. This fixes test problems with both CSL (citeproc_styles) and dateparser (when actually imported in code, which currently on master does not happen).
* | scholix schema links/proposalBryan Newbold2020-01-031-0/+3
| |
* | update bulk edit CHANGELOG and orcid notesBryan Newbold2019-12-312-13/+49
| |
* | Merge branch 'martin-guide-entity-release-fix' into 'master'bnewbold2019-12-311-5/+5
|\ \ | |/ |/| | | | | remove duplicate fields in entity release See merge request webgroup/fatcat!11
| * document year and date of withdrawn releaseMartin Czygan2019-12-171-1/+5
| |
| * remove duplicate fields in entity releaseMartin Czygan2019-12-171-4/+0
| |
* | bulk edit updatesBryan Newbold2019-12-261-3/+4
| |
* | orcid: skip non-person ORCID recordsBryan Newbold2019-12-261-0/+4
| |
* | Merge branch 'martin-datacite-daily-harvest' into 'master'Martin Czygan2019-12-263-5/+73
|\ \ | | | | | | | | | | | | Datacite daily harvest See merge request webgroup/fatcat!6
| * | datacite: fix harvest testMartin Czygan2019-12-271-1/+1
| | | | | | | | | | | | | | | | | | Produced messages should match: jq '.data|length' tests/files/datacite_api.json
| * | datacite: add simple test and fixture for datacite api interactionMartin Czygan2019-12-272-0/+46
| | |
| * | datacite: extend range search queryMartin Czygan2019-12-271-1/+1
| | | | | | | | | | | | | | | The bracket syntax is inclusive. See also: https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-query-string-query.html#_ranges
| * | avoid usage of short linksMartin Czygan2019-12-271-2/+2
| | |
| * | Datacite API v2 throws 400, we cannot recover from, currently.Martin Czygan2019-12-271-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As a first iteration, just mark the daily batch complete and continue. The occasional HTTP 400 issue has been reported as https://github.com/datacite/datacite/issues/897. A possible improvement would be to shrink the window, so losses will be smaller.
| * | datacite: update documentation, add links to issuesMartin Czygan2019-12-271-10/+5
| | |
| * | datacite: use v2 of the API (flaky)Martin Czygan2019-12-271-5/+28
|/ / | | | | | | | | | | | | | | | | Update parameter update for datacite API v2. Works fine, but there are occasional HTTP 400 responses when using the cursor API (daily updates can exceed the 10000 record limit for search queries). The HTTP 400 issue is not solved yet, but reported to datacite as https://github.com/datacite/datacite/issues/897.
* | transform ingests via pmc/pmcid, not pubmed/pmidBryan Newbold2019-12-241-4/+4
| |
* | allow arabesque backfill ingests for some source typesBryan Newbold2019-12-241-0/+5
| |
* | make chocula URL updates more conservativeBryan Newbold2019-12-241-5/+5
| |
* | pubmed: if doing update, also do subtitle schema updateBryan Newbold2019-12-231-1/+9
| |
* | doi parsing fixesBryan Newbold2019-12-231-0/+7
| | | | | | | | | | | | | | | | | | | | Replace emdash with regular dash. Replace double slash after partner ID with single slash. This conversion seems to be done by crossref automatically on lookup. I tried several examples, using doi.org resolver and Crossref API lookup. Note that there are a number of fatcat entities with '//' in the DOI.
* | pubmed bulk import notes (from QA)Bryan Newbold2019-12-231-0/+45
| |
* | pubmed: improve warning and stderr formattingBryan Newbold2019-12-231-5/+6
| |
* | pubmed: use standard identifier cleanersBryan Newbold2019-12-231-17/+14
| |
* | pubmed: remove unused extid mapping codeBryan Newbold2019-12-231-29/+0
| |
* | pubmed: do reference lookups by defaultBryan Newbold2019-12-231-1/+1
| |
* | normalizers: clean_pmid(), and handle nulls in all other cleanersBryan Newbold2019-12-231-0/+31
| |
* | pubmed: null doi parsing checkBryan Newbold2019-12-231-1/+1
| |
* | add basic MedlineDate year parsingBryan Newbold2019-12-231-0/+11
| |
* | add regression test for medlinedate -> year parsingBryan Newbold2019-12-232-0/+102
| |
* | arxiv bulk update notesBryan Newbold2019-12-222-2/+49
| |
* | fix spn/ingest importer duplication checkBryan Newbold2019-12-221-6/+8
| | | | | | | | | | | | Check was happing after the `return True` by mistake, allowing duplicates in SPN editgroups, and potentially in ingest request editgroups as well.