<feed xmlns='http://www.w3.org/2005/Atom'>
<title>fatcat/python/fatcat_tools/workers, branch v0.3.2</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/fatcat/atom?h=v0.3.2</id>
<link rel='self' href='https://git.bnewbold.net/fatcat/atom?h=v0.3.2'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/'/>
<updated>2020-03-29T02:57:41+00:00</updated>
<entry>
<title>ingest: more DOI patterns to treat as OA</title>
<updated>2020-03-29T02:57:41+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-29T02:57:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=4b75a81cbd0faeefa6a0f04b97ecc6832924ee69'/>
<id>urn:sha1:4b75a81cbd0faeefa6a0f04b97ecc6832924ee69</id>
<content type='text'>
These are journal/publisher patterns which we suspect to actually be OA
based on the large quantity of papers that crawl successfully. The
better long-term solution will be to flag containers in some way as OA
(or "should crawl"), but this is a good short-term solution.
</content>
</entry>
<entry>
<title>ingest: always try some lancet journals</title>
<updated>2020-03-20T04:18:18+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-20T04:18:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=b694e74bf72b498301e31459dedfcc1f56400c21'/>
<id>urn:sha1:b694e74bf72b498301e31459dedfcc1f56400c21</id>
<content type='text'>
</content>
</entry>
<entry>
<title>entity worker: ingest more releases</title>
<updated>2020-02-22T23:07:33+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-02-22T22:57:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=dc4116f8ebd225eb4af1cecfc75f5c1291589694'/>
<id>urn:sha1:dc4116f8ebd225eb4af1cecfc75f5c1291589694</id>
<content type='text'>
If release is a dataset or image, don't do a pdf ingest request.

If release is a datacite DOI, and release_type is a "document", crawl
regardless of is_oa detection. This is mostly to crawl repositories
(institutional or subject).
</content>
</entry>
<entry>
<title>always crawl researchgate DOIs</title>
<updated>2020-02-19T07:08:58+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-02-19T07:08:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=9cc369e19d7ba82d07be3d6b24c2526339135a0a'/>
<id>urn:sha1:9cc369e19d7ba82d07be3d6b24c2526339135a0a</id>
<content type='text'>
Now that ingest is fixed
</content>
</entry>
<entry>
<title>add acceptlist override for biorxiv/medrxiv</title>
<updated>2020-02-11T07:18:50+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-02-11T07:18:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=07fabec32aada55a75c064e5c1e01a46da30d854'/>
<id>urn:sha1:07fabec32aada55a75c064e5c1e01a46da30d854</id>
<content type='text'>
</content>
</entry>
<entry>
<title>fix KafkaError worker reporting for partition errors</title>
<updated>2020-01-29T23:37:38+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-01-29T23:37:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=55a4f211532c93d8164b0d4719dc0413005941ea'/>
<id>urn:sha1:55a4f211532c93d8164b0d4719dc0413005941ea</id>
<content type='text'>
</content>
</entry>
<entry>
<title>additional DOI prefix filters</title>
<updated>2020-01-29T03:34:47+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-01-29T03:34:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=a889f3212586ffb85961ad08af32a53e46e0382d'/>
<id>urn:sha1:a889f3212586ffb85961ad08af32a53e46e0382d</id>
<content type='text'>
From martin, thanks.
</content>
</entry>
<entry>
<title>apply ingest request filtering in entity worker</title>
<updated>2020-01-28T21:34:56+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-01-28T21:34:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=943409c2283faa9a6d04ccc6e43886224170e4f2'/>
<id>urn:sha1:943409c2283faa9a6d04ccc6e43886224170e4f2</id>
<content type='text'>
`ingest_oa_only` behavior, and other filters, now handled in the entity
update worker, instead of in the transform function.

Also add a DOI prefix blocklist feature.
</content>
</entry>
<entry>
<title>update ingest request schema</title>
<updated>2019-12-14T02:07:53+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2019-12-14T01:43:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=9c1fd7cb8e60c397fa6defef2f0dc1eacc8d8aa7'/>
<id>urn:sha1:9c1fd7cb8e60c397fa6defef2f0dc1eacc8d8aa7</id>
<content type='text'>
This is mostly changing ingest_type from 'file' to 'pdf', and adding
'link_source'/'link_source_id', plus some small cleanups.
</content>
</entry>
<entry>
<title>project -&gt; ingest_request_source</title>
<updated>2019-11-16T00:51:55+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2019-11-16T00:49:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=4693394d69667570a81126ea727e9ad0ed8e1582'/>
<id>urn:sha1:4693394d69667570a81126ea727e9ad0ed8e1582</id>
<content type='text'>
</content>
</entry>
</feed>
