<feed xmlns='http://www.w3.org/2005/Atom'>
<title>fatcat/python/fatcat_tools/importers, branch v0.3.2</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/fatcat/atom?h=v0.3.2</id>
<link rel='self' href='https://git.bnewbold.net/fatcat/atom?h=v0.3.2'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/'/>
<updated>2020-04-01T19:02:45Z</updated>
<entry>
<title>pubmed: use untranslated title if translated not available</title>
<updated>2020-04-01T19:02:45Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-04-01T19:02:43Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=938d2c5366d80618b839c83baadc9b5c62d10dce'/>
<id>urn:sha1:938d2c5366d80618b839c83baadc9b5c62d10dce</id>
<content type='text'>
The primary motivation for this change is that fatcat *requires* a
non-empty title for each release entity. Pubmed/Medline occasionally
indexes just a VenacularTitle with no ArticleTitle for foreign
publications, and currently those records don't end up in fatcat at all.
</content>
</entry>
<entry>
<title>importers: replace newlines in get_text() strings</title>
<updated>2020-04-01T19:02:20Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-04-01T19:02:20Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=f77a553350238c8ccc9c3bc0edcf47fb9dd067b3'/>
<id>urn:sha1:f77a553350238c8ccc9c3bc0edcf47fb9dd067b3</id>
<content type='text'>
</content>
</entry>
<entry>
<title>importers: more string/get_text swaps</title>
<updated>2020-03-29T03:12:58Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-29T03:12:54Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=6681500eeffe39b7d029a0e0d6b2ed83729f555f'/>
<id>urn:sha1:6681500eeffe39b7d029a0e0d6b2ed83729f555f</id>
<content type='text'>
See previous pubmed commit for details.
</content>
</entry>
<entry>
<title>pubmed: bunch of .get_text() instead of .string</title>
<updated>2020-03-29T03:01:48Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-29T03:01:46Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=d6af7b7544ddb3b5e7b1f4a0fd76bd9cd5ed9125'/>
<id>urn:sha1:d6af7b7544ddb3b5e7b1f4a0fd76bd9cd5ed9125</id>
<content type='text'>
Yikes! Apparently when a tag has child tags, .string will return None
instead of all the strings. .get_text() returns all of it:

  https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
  https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string

I've things like identifiers as .string, when we expect only a single
string inside.
</content>
</entry>
<entry>
<title>Merge pull request #53 from EdwardBetts/spelling</title>
<updated>2020-03-27T23:50:08Z</updated>
<author>
<name>bnewbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-03-27T23:50:08Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=98abe2e751187aa7c2e751b355ffb56d9b1f8c6a'/>
<id>urn:sha1:98abe2e751187aa7c2e751b355ffb56d9b1f8c6a</id>
<content type='text'>
Correct spelling mistakes</content>
</entry>
<entry>
<title>Correct spelling mistakes</title>
<updated>2020-03-27T21:25:54Z</updated>
<author>
<name>Edward Betts</name>
<email>edward@4angle.com</email>
</author>
<published>2020-03-27T21:25:54Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=94710b2803780ab16fb30b79010f8e27cf115512'/>
<id>urn:sha1:94710b2803780ab16fb30b79010f8e27cf115512</id>
<content type='text'>
</content>
</entry>
<entry>
<title>datacite: nameIdentifier corner case</title>
<updated>2020-03-26T21:09:15Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-26T20:58:32Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=ec82404f0d0ad6b92491a1cb90a823d421857348'/>
<id>urn:sha1:ec82404f0d0ad6b92491a1cb90a823d421857348</id>
<content type='text'>
Works around a bug in production:

  AttributeError: 'NoneType' object has no attribute 'replace'
  (datacite.py:724)

NOTE: there are no tests for this code path
</content>
</entry>
<entry>
<title>jalc: avoid meaningless pages values</title>
<updated>2020-03-23T21:22:30Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-23T21:22:30Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=786c19220a88df89535bba79123b80cde1da2931'/>
<id>urn:sha1:786c19220a88df89535bba79123b80cde1da2931</id>
<content type='text'>
</content>
</entry>
<entry>
<title>datacite: add year sanity restrictions</title>
<updated>2020-03-23T16:37:08Z</updated>
<author>
<name>bnewbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-03-23T16:37:08Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=8af9df9fff925c90f2bfb52c4a2b2ea918b4eda2'/>
<id>urn:sha1:8af9df9fff925c90f2bfb52c4a2b2ea918b4eda2</id>
<content type='text'>
Example of entities with bogus years:

https://fatcat.wiki/release/search?q=doi_registrar%3Adatacite+year%3A%3E2100

We can do a clean-up task, but first need to prevent creation of new bad
metadata.
</content>
</entry>
<entry>
<title>pubmed: handle multiple ReferenceList</title>
<updated>2020-03-20T20:00:52Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-20T20:00:50Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=a6f74183dd1cf1eaa44f7edeb98dbc5dc737dabb'/>
<id>urn:sha1:a6f74183dd1cf1eaa44f7edeb98dbc5dc737dabb</id>
<content type='text'>
This resolves a situation noticed in prod where we were only
importing/updating a single reference per article.

Includes a regression test.
</content>
</entry>
</feed>
