summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/pubmed.py
Commit message (Collapse)AuthorAgeFilesLines
* pubmed: use untranslated title if translated not availableBryan Newbold2020-04-011-0/+6
| | | | | | | The primary motivation for this change is that fatcat *requires* a non-empty title for each release entity. Pubmed/Medline occasionally indexes just a VenacularTitle with no ArticleTitle for foreign publications, and currently those records don't end up in fatcat at all.
* importers: replace newlines in get_text() stringsBryan Newbold2020-04-011-5/+7
|
* pubmed: bunch of .get_text() instead of .stringBryan Newbold2020-03-281-12/+12
| | | | | | | | | | | Yikes! Apparently when a tag has child tags, .string will return None instead of all the strings. .get_text() returns all of it: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string I've things like identifiers as .string, when we expect only a single string inside.
* pubmed: handle multiple ReferenceListBryan Newbold2020-03-201-1/+4
| | | | | | | This resolves a situation noticed in prod where we were only importing/updating a single reference per article. Includes a regression test.
* pubmed: update many more metadata fieldsBryan Newbold2020-03-191-0/+22
| | | | | | | In particular, with daily updates in most cases the DOI will be registered first, then the entity updated with PMID when that is available. Often the pubmed metadata will be more complete, with abstracts etc, and we'll want those improvements.
* importers: control update behavior with more-standard flagBryan Newbold2020-01-061-0/+4
|
* pubmed: if doing update, also do subtitle schema updateBryan Newbold2019-12-231-1/+9
|
* pubmed: improve warning and stderr formattingBryan Newbold2019-12-231-5/+6
|
* pubmed: use standard identifier cleanersBryan Newbold2019-12-231-17/+14
|
* pubmed: remove unused extid mapping codeBryan Newbold2019-12-231-29/+0
|
* pubmed: do reference lookups by defaultBryan Newbold2019-12-231-1/+1
|
* pubmed: null doi parsing checkBryan Newbold2019-12-231-1/+1
|
* add basic MedlineDate year parsingBryan Newbold2019-12-231-0/+11
|
* refactor all python source for client lib nameBryan Newbold2019-09-051-16/+16
|
* more pubmed importer fixesBryan Newbold2019-06-031-6/+13
|
* yet another pubmed weird DOI corner caseBryan Newbold2019-05-291-1/+1
|
* handle pubmed CollectiveName null-nessBryan Newbold2019-05-291-1/+1
|
* handle empty retraction_of.PMID in pubmed importerBryan Newbold2019-05-291-2/+4
|
* more MARC languages, and less verbose reportingBryan Newbold2019-05-241-1/+1
|
* pubmed DOIs need strip()Bryan Newbold2019-05-221-1/+1
|
* pubmed: try to work around multi-editsBryan Newbold2019-05-221-3/+13
|
* more strict pubmed DOI handlingBryan Newbold2019-05-221-1/+3
|
* more pubmed checks; handle PMID/DOI mismatch differentlyBryan Newbold2019-05-221-2/+7
|
* all new importers need to set contrib index (order)Bryan Newbold2019-05-221-0/+4
|
* pubmed importer command and tweaksBryan Newbold2019-05-221-9/+227
|
* importers: create containers by defaultBryan Newbold2019-05-211-1/+2
|
* updates to pubmed importerBryan Newbold2019-05-211-32/+60
|
* fix lint issue in pubmed importerBryan Newbold2019-05-211-1/+1
|
* tweaks to new imports/testsBryan Newbold2019-05-211-6/+4
|
* initial pubmed importerBryan Newbold2019-05-211-0/+512