fatcat/python/fatcat_tools/importers, branch v0.3.2

fatcat/python/fatcat_tools/importers, branch v0.3.2 [no description] https://git.bnewbold.net/fatcat/atom?h=v0.3.2 2020-04-01T19:02:45Z pubmed: use untranslated title if translated not available 2020-04-01T19:02:45Z Bryan Newbold bnewbold@robocracy.org 2020-04-01T19:02:43Z urn:sha1:938d2c5366d80618b839c83baadc9b5c62d10dce The primary motivation for this change is that fatcat *requires* a non-empty title for each release entity. Pubmed/Medline occasionally indexes just a VenacularTitle with no ArticleTitle for foreign publications, and currently those records don't end up in fatcat at all. importers: replace newlines in get_text() strings 2020-04-01T19:02:20Z Bryan Newbold bnewbold@robocracy.org 2020-04-01T19:02:20Z urn:sha1:f77a553350238c8ccc9c3bc0edcf47fb9dd067b3 importers: more string/get_text swaps 2020-03-29T03:12:58Z Bryan Newbold bnewbold@robocracy.org 2020-03-29T03:12:54Z urn:sha1:6681500eeffe39b7d029a0e0d6b2ed83729f555f See previous pubmed commit for details. pubmed: bunch of .get_text() instead of .string 2020-03-29T03:01:48Z Bryan Newbold bnewbold@robocracy.org 2020-03-29T03:01:46Z urn:sha1:d6af7b7544ddb3b5e7b1f4a0fd76bd9cd5ed9125 Yikes! Apparently when a tag has child tags, .string will return None instead of all the strings. .get_text() returns all of it: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string I've things like identifiers as .string, when we expect only a single string inside. Merge pull request #53 from EdwardBetts/spelling 2020-03-27T23:50:08Z bnewbold bnewbold@archive.org 2020-03-27T23:50:08Z urn:sha1:98abe2e751187aa7c2e751b355ffb56d9b1f8c6a Correct spelling mistakes Correct spelling mistakes 2020-03-27T21:25:54Z Edward Betts edward@4angle.com 2020-03-27T21:25:54Z urn:sha1:94710b2803780ab16fb30b79010f8e27cf115512 datacite: nameIdentifier corner case 2020-03-26T21:09:15Z Bryan Newbold bnewbold@robocracy.org 2020-03-26T20:58:32Z urn:sha1:ec82404f0d0ad6b92491a1cb90a823d421857348 Works around a bug in production: AttributeError: 'NoneType' object has no attribute 'replace' (datacite.py:724) NOTE: there are no tests for this code path jalc: avoid meaningless pages values 2020-03-23T21:22:30Z Bryan Newbold bnewbold@robocracy.org 2020-03-23T21:22:30Z urn:sha1:786c19220a88df89535bba79123b80cde1da2931 datacite: add year sanity restrictions 2020-03-23T16:37:08Z bnewbold bnewbold@archive.org 2020-03-23T16:37:08Z urn:sha1:8af9df9fff925c90f2bfb52c4a2b2ea918b4eda2 Example of entities with bogus years: https://fatcat.wiki/release/search?q=doi_registrar%3Adatacite+year%3A%3E2100 We can do a clean-up task, but first need to prevent creation of new bad metadata. pubmed: handle multiple ReferenceList 2020-03-20T20:00:52Z Bryan Newbold bnewbold@robocracy.org 2020-03-20T20:00:50Z urn:sha1:a6f74183dd1cf1eaa44f7edeb98dbc5dc737dabb This resolves a situation noticed in prod where we were only importing/updating a single reference per article. Includes a regression test.