<feed xmlns='http://www.w3.org/2005/Atom'>
<title>fatcat/python/fatcat_tools/workers, branch bnewbold-rust-gen-v5</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/fatcat/atom?h=bnewbold-rust-gen-v5</id>
<link rel='self' href='https://git.bnewbold.net/fatcat/atom?h=bnewbold-rust-gen-v5'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/'/>
<updated>2020-04-17T23:29:40Z</updated>
<entry>
<title>more changelog ES fixes</title>
<updated>2020-04-17T23:29:40Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-04-17T23:29:40Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=9e9f7f1da115458d87f8bcdc011b2843e2b31d3b'/>
<id>urn:sha1:9e9f7f1da115458d87f8bcdc011b2843e2b31d3b</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ES changelog worker: fixes for ident; fetch update from API if needed</title>
<updated>2020-04-17T22:32:20Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-04-17T22:32:18Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=026e352f5d99652f088b6bcdc28d43106b8f52d2'/>
<id>urn:sha1:026e352f5d99652f088b6bcdc28d43106b8f52d2</id>
<content type='text'>
The API fetch update may be needed for old changelog entries in the
kafka feed.
</content>
</entry>
<entry>
<title>Merge branch 'martin-changelog-to-es' into 'master'</title>
<updated>2020-04-17T18:13:14Z</updated>
<author>
<name>bnewbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-04-17T18:13:14Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=963faf6cf6e7e5c6685ffe89e080134c7590957f'/>
<id>urn:sha1:963faf6cf6e7e5c6685ffe89e080134c7590957f</id>
<content type='text'>
derive changelog worker from release worker

See merge request webgroup/fatcat!43</content>
</entry>
<entry>
<title>derive changelog worker from release worker</title>
<updated>2020-04-17T12:43:31Z</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2020-04-17T12:30:57Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=89db8df9eef40b92454ed9bd64830ebe5b726b9a'/>
<id>urn:sha1:89db8df9eef40b92454ed9bd64830ebe5b726b9a</id>
<content type='text'>
Early versions of changelog entries may not have all the fields
required for the current transform.
</content>
</entry>
<entry>
<title>changelog: limit types</title>
<updated>2020-04-16T18:54:20Z</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2020-04-16T18:54:20Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=e063d2f470951f1735ccca8c6ea4b37029a6fede'/>
<id>urn:sha1:e063d2f470951f1735ccca8c6ea4b37029a6fede</id>
<content type='text'>
No partial docs (e.g. abstract), too generic components and entries, not
HTML blogs.
</content>
</entry>
<entry>
<title>changelog: extend release_types considered documents</title>
<updated>2020-04-15T23:22:57Z</updated>
<author>
<name>Martin Czygan</name>
<email>martin.czygan@gmail.com</email>
</author>
<published>2020-04-15T23:17:45Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=0071b77eb7fc20be4af1bbf9b6c0bfcb4e26816a'/>
<id>urn:sha1:0071b77eb7fc20be4af1bbf9b6c0bfcb4e26816a</id>
<content type='text'>
according to release_rev.release_type, we have 29 values:

    fatcat_prod=# select release_type, count(release_type) from release_rev group by release_type;

       release_type    |   count
    -------------------+-----------
     abstract          |      2264
     article           |   6371076
     article-journal   | 101083841
     article-newspaper |     17062
     book              |   1676941
     chapter           |  13914854
     component         |     58990
     dataset           |   6860325
     editorial         |    133573
     entry             |   1628487
     graphic           |   1809471
     interview         |     19898
     legal_case        |      3581
     legislation       |      1626
     letter            |    275119
     paper-conference  |   6074669
     peer_review       |     30581
     post              |    245807
     post-weblog       |       135
     report            |   1010699
     retraction        |      1292
     review-book       |     96219
     software          |       316
     song              |     24027
     speech            |      4263
     standard          |    312364
     stub              |   1036813
     thesis            |    414397
                       |         0
    (29 rows)
</content>
</entry>
<entry>
<title>ingest: more DOI patterns to treat as OA</title>
<updated>2020-03-29T02:57:41Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-29T02:57:35Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=4b75a81cbd0faeefa6a0f04b97ecc6832924ee69'/>
<id>urn:sha1:4b75a81cbd0faeefa6a0f04b97ecc6832924ee69</id>
<content type='text'>
These are journal/publisher patterns which we suspect to actually be OA
based on the large quantity of papers that crawl successfully. The
better long-term solution will be to flag containers in some way as OA
(or "should crawl"), but this is a good short-term solution.
</content>
</entry>
<entry>
<title>ingest: always try some lancet journals</title>
<updated>2020-03-20T04:18:18Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-03-20T04:18:18Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=b694e74bf72b498301e31459dedfcc1f56400c21'/>
<id>urn:sha1:b694e74bf72b498301e31459dedfcc1f56400c21</id>
<content type='text'>
</content>
</entry>
<entry>
<title>entity worker: ingest more releases</title>
<updated>2020-02-22T23:07:33Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-02-22T22:57:44Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=dc4116f8ebd225eb4af1cecfc75f5c1291589694'/>
<id>urn:sha1:dc4116f8ebd225eb4af1cecfc75f5c1291589694</id>
<content type='text'>
If release is a dataset or image, don't do a pdf ingest request.

If release is a datacite DOI, and release_type is a "document", crawl
regardless of is_oa detection. This is mostly to crawl repositories
(institutional or subject).
</content>
</entry>
<entry>
<title>always crawl researchgate DOIs</title>
<updated>2020-02-19T07:08:58Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@robocracy.org</email>
</author>
<published>2020-02-19T07:08:56Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/fatcat/commit/?id=9cc369e19d7ba82d07be3d6b24c2526339135a0a'/>
<id>urn:sha1:9cc369e19d7ba82d07be3d6b24c2526339135a0a</id>
<content type='text'>
Now that ingest is fixed
</content>
</entry>
</feed>
