aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/datacite.py
Commit message (Collapse)AuthorAgeFilesLines
* datacite importer: skip container_id for some repository sourcesBryan Newbold2022-02-091-0/+34
|
* refactor importer metadata tables into separate file; move some helpers aroundBryan Newbold2021-11-101-145/+10
| | | | | | | - MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work)
* importers: refactor imports of clean() and other normalization helpersBryan Newbold2021-11-101-12/+12
|
* datacite import: store less subject metadataBryan Newbold2021-11-101-1/+7
| | | | | | | | Many of these 'subject' objects have the equivalent of several lines of text, with complex URLs that don't compress well. I think it is fine we have included these thus far instead of parsing more deeply, but going forward I don't think this nested 'extra' metadata is worth the database space.
* remove deprecated extid sqlite3 lookup table feature from importersBryan Newbold2021-11-091-54/+0
| | | | | | | | This was used during initial bulk imports, but is no longer used and could create serious metadata problems if used accidentially. In retrospect, it also made metadata provenance less transparent, and may have done more harm than good overall.
* datacite importer: remove unused 'year_only' variableBryan Newbold2021-11-031-2/+3
|
* datacite: add comment about potential date parsing bugBryan Newbold2021-11-031-0/+1
|
* datacite importer: dateparser.date.DateDataParser()Bryan Newbold2021-11-031-1/+1
| | | | Perhaps this was a change when upgrading 'dateparser'?
* more involved type wrangling and fixes for importersBryan Newbold2021-11-031-2/+3
|
* typing: relatively simple type check fixesBryan Newbold2021-11-031-8/+10
| | | | | | | These mostly add new variable names so that existing variables aren't overwritten with a new type; delay coercing '{}' or '[]' to 'None' until the last minute; adding is-not-None checks to conditional clauses; and similar small changes.
* typing: initial annotations on importersBryan Newbold2021-11-031-30/+59
| | | | | This commit just adds the type annotations, doesn't do fixes to code to make type checking pass.
* fmt (black): fatcat_tools/Bryan Newbold2021-11-021-380/+444
|
* python: isort everythingBryan Newbold2021-11-021-1/+1
|
* lint: simple, safe inline lint fixesBryan Newbold2021-11-021-2/+2
| | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc
* datacite: skip empty abstractsMartin Czygan2021-10-011-1/+4
| | | | | Do not add abstracts where `clean` results in the empty string - this violates a constraint: `either abstract_sha1 or content is required`
* datacite: more careful title string access; fixes sentry #88350Martin Czygan2021-06-111-1/+1
| | | | | Caused by a partial "title entry without title" coming *first* (e.g. just holding, e.g. a language, like: {'lang': 'da'}
* datacite: a missing surname should be None, not the empty stringMartin Czygan2021-04-021-2/+1
| | | | refs sentry #77700
* crossref+datacite: remove confusing early update bailBryan Newbold2020-11-201-2/+0
| | | | | Easy to miss that we skip updates *twice*, and with this early bailout were not updating counts correctly.
* refactor: white/black -> allow/blockBryan Newbold2020-11-051-4/+4
|
* address spammy datacite titlesMartin Czygan2020-09-231-0/+19
| | | | | | | | | seemingly from zenodo: * https://fatcat.wiki/release/rzcpjwukobd4pj36ipla22cnoi * https://doi.org/10.5281/zenodo.4041777 About 3400 records with "FULL MOVIE" in title, currently.
* datacite: handle case of empty-string versionBryan Newbold2020-09-101-1/+1
| | | | | Includes a tiny tweak to the datacite import sample file to test this code path.
* datacite import: figshare-specific hacksBryan Newbold2020-08-111-3/+3
|
* datacite import: refactor release_type detection into static methodBryan Newbold2020-08-111-14/+51
|
* datacite import: refactor publisher-specific hacks into static methodBryan Newbold2020-08-111-15/+29
| | | | Also tweak title/publisher detection to use DOI prefixes
* remove isascii() work around definition in importers/datacite.pyBryan Newbold2020-07-231-7/+1
| | | | We are python3.7 now, so this isn't needed.
* simple lint (flake8) fixes over python codebaseBryan Newbold2020-07-231-7/+7
| | | | | | These should not have any behavior changes, though a number of exception catches are now more general, and there may be long-tail exceptions getting thrown in these statements.
* Merge branch 'martin-datacite-duplicated-author-gh-59' into 'master'bnewbold2020-07-111-6/+60
|\ | | | | | | | | datacite: address duplicated contributor issue See merge request webgroup/fatcat!65
| * datacite: resolve formatting issues in testsMartin Czygan2020-07-101-2/+1
| |\
| * | datacite: there should be no index gapsMartin Czygan2020-07-101-2/+8
| | |
| * | datacite: document contributor typesMartin Czygan2020-07-101-0/+25
| | |
| * | wip: contrib, GH59Martin Czygan2020-07-101-16/+22
| | |
| * | datacite: address duplicated contributor issueMartin Czygan2020-07-071-0/+16
| | | | | | | | | | | | | | | | | | | | | Use string comparison. * https://fatcat.wiki/release/spjysmrnsrgyzgq6ise5o44rlu/contribs * https://api.datacite.org/dois/10.25940/roper-31098406
* | | datacite: mitigate sentry #44035Martin Czygan2020-07-101-0/+4
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to sentry, running `c.get('nameIdentifiers', []) or []` on a c with value: ``` {'affiliation': [], 'familyName': 'Guidon', 'givenName': 'Manuel', 'nameIdentifiers': {'nameIdentifier': 'https://orcid.org/0000-0003-3543-6683', 'nameIdentifierScheme': 'ORCID', 'schemeUri': 'https://orcid.org'}, 'nameType': 'Personal'} ``` results in a string, which I cannot reproduce. The document in question at: https://api.datacite.org/dois/10.26275/kuw1-fdls seems fine, too.
* | datacite: fix attribute errorMartin Czygan2020-07-071-1/+1
| | | | | | | | refs: #44035
* | lint (flake8) tool python filesBryan Newbold2020-07-011-2/+0
|/
* add new license mappingsBryan Newbold2020-06-301-0/+14
|
* datacite: improve license mappingMartin Czygan2020-06-301-9/+15
| | | | via "missed potential license", refs #58
* datacite: hard cast possible date value to stringMartin Czygan2020-06-291-1/+1
|
* datacite: fix type errorMartin Czygan2020-04-221-1/+3
| | | | | | | Up to now, we expected the description to be a string or list. Add handling for int as well. First appeared: Apr 22 19:58:39.
* datacite: fix a raw name constraint violationMartin Czygan2020-04-201-0/+8
| | | | | | | It was possible that contribs got added which had no raw name. One example would be a name consisting of whitespace only. This fix adds a final check for this case.
* Merge pull request #53 from EdwardBetts/spellingbnewbold2020-03-271-4/+4
|\ | | | | Correct spelling mistakes
| * Correct spelling mistakesEdward Betts2020-03-271-4/+4
| |
* | datacite: nameIdentifier corner caseBryan Newbold2020-03-261-1/+2
| | | | | | | | | | | | | | | | | | Works around a bug in production: AttributeError: 'NoneType' object has no attribute 'replace' (datacite.py:724) NOTE: there are no tests for this code path
* | datacite: add year sanity restrictionsbnewbold2020-03-231-0/+7
|/ | | | | | | | | Example of entities with bogus years: https://fatcat.wiki/release/search?q=doi_registrar%3Adatacite+year%3A%3E2100 We can do a clean-up task, but first need to prevent creation of new bad metadata.
* datacite: prevent noneMartin Czygan2020-01-311-1/+1
|
* datacite: name shall not be NoneMartin Czygan2020-01-311-1/+1
|
* datacite: add exception for https://www.micropublication.org/Martin Czygan2020-01-311-0/+5
|
* datacite: do not skip records w/o dateMartin Czygan2020-01-311-2/+1
|
* datacite: improve docstringMartin Czygan2020-01-311-4/+4
|
* datacite: improve date handling and minor tweakMartin Czygan2020-01-301-19/+42
| | | | | | | | | | | | | Records from https://www.micropublication.org/ did not have a date in FC, although raw data contained date strings - they were not using the finer-grained "attributes.date" but "attributes.published" and/or "attributes.publicationYear". Support for those fields has been added, including a test case. During this test (#30) a processing gap for names became clear (author may have "given_name" and "surname", but no "name"). This bug has been fixed, too.