<feed xmlns='http://www.w3.org/2005/Atom'>
<title>fatcat/python/fatcat_tools, branch v0.3.2</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/fatcat/atom?h=v0.3.2</id>
<link rel='self' href='https://git.bnewbold.net/fatcat/atom?h=v0.3.2'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/'/>
<updated>2020-04-01T22:03:19+00:00</updated>
<entry>
<title>Merge branch 'bnewbold-pubmed-get_text' into 'master'</title>
<updated>2020-04-01T22:03:19+00:00</updated>
<author>
<name>bnewbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-04-01T22:03:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=32f195cec41459045f3d3453dad7a97b38d4e288'/>
<id>urn:sha1:32f195cec41459045f3d3453dad7a97b38d4e288</id>
<content type='text'>
beautifulsoup XML parsing: .string vs. .get_text()

See merge request webgroup/fatcat!40</content>
</entry>
<entry>
<title>pubmed: use untranslated title if translated not available</title>
<updated>2020-04-01T19:02:45+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-04-01T19:02:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=938d2c5366d80618b839c83baadc9b5c62d10dce'/>
<id>urn:sha1:938d2c5366d80618b839c83baadc9b5c62d10dce</id>
<content type='text'>
The primary motivation for this change is that fatcat *requires* a
non-empty title for each release entity. Pubmed/Medline occasionally
indexes just a VenacularTitle with no ArticleTitle for foreign
publications, and currently those records don't end up in fatcat at all.
</content>
</entry>
<entry>
<title>importers: replace newlines in get_text() strings</title>
<updated>2020-04-01T19:02:20+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-04-01T19:02:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=f77a553350238c8ccc9c3bc0edcf47fb9dd067b3'/>
<id>urn:sha1:f77a553350238c8ccc9c3bc0edcf47fb9dd067b3</id>
<content type='text'>
</content>
</entry>
<entry>
<title>crossref: switch from index-date to update-date</title>
<updated>2020-03-31T04:23:11+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-31T03:56:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=851c40143d44a73a92ff2c9556b3a63f29668c2d'/>
<id>urn:sha1:851c40143d44a73a92ff2c9556b3a63f29668c2d</id>
<content type='text'>
This goes against what the API docs recommend, but we are currently far
behind on updates and need to catch up. Other than what the docs say,
this seems to be consistent with the behavior we want.
</content>
</entry>
<entry>
<title>crossref: longer comment about crossref API date fields</title>
<updated>2020-03-31T03:55:44+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-31T03:55:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=98933a068ec3d918deb0e7dff30aed517ca515d9'/>
<id>urn:sha1:98933a068ec3d918deb0e7dff30aed517ca515d9</id>
<content type='text'>
</content>
</entry>
<entry>
<title>importers: more string/get_text swaps</title>
<updated>2020-03-29T03:12:58+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-29T03:12:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=6681500eeffe39b7d029a0e0d6b2ed83729f555f'/>
<id>urn:sha1:6681500eeffe39b7d029a0e0d6b2ed83729f555f</id>
<content type='text'>
See previous pubmed commit for details.
</content>
</entry>
<entry>
<title>pubmed: bunch of .get_text() instead of .string</title>
<updated>2020-03-29T03:01:48+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-29T03:01:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=d6af7b7544ddb3b5e7b1f4a0fd76bd9cd5ed9125'/>
<id>urn:sha1:d6af7b7544ddb3b5e7b1f4a0fd76bd9cd5ed9125</id>
<content type='text'>
Yikes! Apparently when a tag has child tags, .string will return None
instead of all the strings. .get_text() returns all of it:

  https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
  https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string

I've things like identifiers as .string, when we expect only a single
string inside.
</content>
</entry>
<entry>
<title>ingest: more DOI patterns to treat as OA</title>
<updated>2020-03-29T02:57:41+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-29T02:57:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=4b75a81cbd0faeefa6a0f04b97ecc6832924ee69'/>
<id>urn:sha1:4b75a81cbd0faeefa6a0f04b97ecc6832924ee69</id>
<content type='text'>
These are journal/publisher patterns which we suspect to actually be OA
based on the large quantity of papers that crawl successfully. The
better long-term solution will be to flag containers in some way as OA
(or "should crawl"), but this is a good short-term solution.
</content>
</entry>
<entry>
<title>Merge pull request #53 from EdwardBetts/spelling</title>
<updated>2020-03-27T23:50:08+00:00</updated>
<author>
<name>bnewbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-03-27T23:50:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=98abe2e751187aa7c2e751b355ffb56d9b1f8c6a'/>
<id>urn:sha1:98abe2e751187aa7c2e751b355ffb56d9b1f8c6a</id>
<content type='text'>
Correct spelling mistakes</content>
</entry>
<entry>
<title>Correct spelling mistakes</title>
<updated>2020-03-27T21:25:54+00:00</updated>
<author>
<name>Edward Betts</name>
<email>edward@4angle.com</email>
</author>
<published>2020-03-27T21:25:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=94710b2803780ab16fb30b79010f8e27cf115512'/>
<id>urn:sha1:94710b2803780ab16fb30b79010f8e27cf115512</id>
<content type='text'>
</content>
</entry>
</feed>
