summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/pubmed.py
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'bnewbold-import-refactors' into 'master'bnewbold2021-11-111-313/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | import refactors and deprecations Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes. The Datacite-specific stuff could use review here. Remove unused/deprecated/dead code: - cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers - "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used) Refactors: - moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code) - shuffled around relative imports and some function names ("clean_str" vs. "clean") Some actual behavioral changes: - remove some Datacite-specific license slugs - stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!) - remove some excess metadata from datacite 'extra' fields
| * refactor importer metadata tables into separate file; move some helpers aroundBryan Newbold2021-11-101-314/+5
| | | | | | | | | | | | | | - MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work)
| * importers: refactor imports of clean() and other normalization helpersBryan Newbold2021-11-101-4/+11
| |
* | pubmed: allow updates if PMCID does not exist yetBryan Newbold2021-11-101-1/+6
|/ | | | | | | | | | | The intent of this change is to start updating Pubmed metadata records when a PMCID has been assigned, but that ext_id hasn't been recorded in fatcat yet. It is likely that this change will result in some additional duplicate PMCIDs in the catalog. But the principle is that the PMID is the primary pubmed identifier, and all records with a PMID should have the PMCID that pubmed indicates, even if there exists another incorrect record.
* typing: relatively simple type check fixesBryan Newbold2021-11-031-17/+7
| | | | | | | These mostly add new variable names so that existing variables aren't overwritten with a new type; delay coercing '{}' or '[]' to 'None' until the last minute; adding is-not-None checks to conditional clauses; and similar small changes.
* typing: initial annotations on importersBryan Newbold2021-11-031-10/+17
| | | | | This commit just adds the type annotations, doesn't do fixes to code to make type checking pass.
* importers: remove unused __main__ routineBryan Newbold2021-11-031-5/+0
| | | | | | These perhaps were used in initial develoment or testing? fatcat_import.py is the correct way to do these imports, even for testing/development.
* fmt (black): fatcat_tools/Bryan Newbold2021-11-021-158/+197
|
* lint: simple, safe inline lint fixesBryan Newbold2021-11-021-1/+1
| | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc
* lint/fmt: remove all 'import *'Bryan Newbold2021-11-021-5/+7
|
* python: partial importer utilization of new schema changesBryan Newbold2021-10-131-3/+9
|
* fix issnl typo in pubmedBryan Newbold2020-07-231-1/+1
| | | | | | | | | | Oh no! This bug may actually have had significant negative impact on metadata in fatcat, in terms of missing container_id associations with pubmed entities. There are about 500k release entities with a PMID but no container_id. Of those, 89k have at least a container_name. Unclear how many would have matched to ISSN-L and thus to a container.
* lint (flake8) tool python filesBryan Newbold2020-07-011-4/+2
|
* importers: clarify handling of ApiExceptionBryan Newbold2020-05-221-0/+1
| | | | | | | | One of these (in ingest importer pipeline) is an actual bug, the others are just changing the syntax to be more explicit/conservative. The ingest importer bug seems to have resulted in some bad file match imports; scale of impact is unknown.
* pubmed: use untranslated title if translated not availableBryan Newbold2020-04-011-0/+6
| | | | | | | The primary motivation for this change is that fatcat *requires* a non-empty title for each release entity. Pubmed/Medline occasionally indexes just a VenacularTitle with no ArticleTitle for foreign publications, and currently those records don't end up in fatcat at all.
* importers: replace newlines in get_text() stringsBryan Newbold2020-04-011-5/+7
|
* pubmed: bunch of .get_text() instead of .stringBryan Newbold2020-03-281-12/+12
| | | | | | | | | | | Yikes! Apparently when a tag has child tags, .string will return None instead of all the strings. .get_text() returns all of it: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string I've things like identifiers as .string, when we expect only a single string inside.
* pubmed: handle multiple ReferenceListBryan Newbold2020-03-201-1/+4
| | | | | | | This resolves a situation noticed in prod where we were only importing/updating a single reference per article. Includes a regression test.
* pubmed: update many more metadata fieldsBryan Newbold2020-03-191-0/+22
| | | | | | | In particular, with daily updates in most cases the DOI will be registered first, then the entity updated with PMID when that is available. Often the pubmed metadata will be more complete, with abstracts etc, and we'll want those improvements.
* importers: control update behavior with more-standard flagBryan Newbold2020-01-061-0/+4
|
* pubmed: if doing update, also do subtitle schema updateBryan Newbold2019-12-231-1/+9
|
* pubmed: improve warning and stderr formattingBryan Newbold2019-12-231-5/+6
|
* pubmed: use standard identifier cleanersBryan Newbold2019-12-231-17/+14
|
* pubmed: remove unused extid mapping codeBryan Newbold2019-12-231-29/+0
|
* pubmed: do reference lookups by defaultBryan Newbold2019-12-231-1/+1
|
* pubmed: null doi parsing checkBryan Newbold2019-12-231-1/+1
|
* add basic MedlineDate year parsingBryan Newbold2019-12-231-0/+11
|
* refactor all python source for client lib nameBryan Newbold2019-09-051-16/+16
|
* more pubmed importer fixesBryan Newbold2019-06-031-6/+13
|
* yet another pubmed weird DOI corner caseBryan Newbold2019-05-291-1/+1
|
* handle pubmed CollectiveName null-nessBryan Newbold2019-05-291-1/+1
|
* handle empty retraction_of.PMID in pubmed importerBryan Newbold2019-05-291-2/+4
|
* more MARC languages, and less verbose reportingBryan Newbold2019-05-241-1/+1
|
* pubmed DOIs need strip()Bryan Newbold2019-05-221-1/+1
|
* pubmed: try to work around multi-editsBryan Newbold2019-05-221-3/+13
|
* more strict pubmed DOI handlingBryan Newbold2019-05-221-1/+3
|
* more pubmed checks; handle PMID/DOI mismatch differentlyBryan Newbold2019-05-221-2/+7
|
* all new importers need to set contrib index (order)Bryan Newbold2019-05-221-0/+4
|
* pubmed importer command and tweaksBryan Newbold2019-05-221-9/+227
|
* importers: create containers by defaultBryan Newbold2019-05-211-1/+2
|
* updates to pubmed importerBryan Newbold2019-05-211-32/+60
|
* fix lint issue in pubmed importerBryan Newbold2019-05-211-1/+1
|
* tweaks to new imports/testsBryan Newbold2019-05-211-6/+4
|
* initial pubmed importerBryan Newbold2019-05-211-0/+512