<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sandcrawler/please, branch bnewbold-persist-grobid-errors</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-persist-grobid-errors</id>
<link rel='self' href='https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-persist-grobid-errors'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/'/>
<updated>2019-09-26T00:54:56+00:00</updated>
<entry>
<title>point 'please' to python_hadoop</title>
<updated>2019-09-26T00:54:56+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-09-26T00:54:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=353dc0c2954d9f834fcccb49558728e326abca5b'/>
<id>urn:sha1:353dc0c2954d9f834fcccb49558728e326abca5b</id>
<content type='text'>
</content>
</entry>
<entry>
<title>GroupFatcatWorksSubsetJob</title>
<updated>2019-08-26T21:25:34+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-08-26T21:25:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=d2d545675b3c85c3e1b41fd4bda23230d995bf47'/>
<id>urn:sha1:d2d545675b3c85c3e1b41fd4bda23230d995bf47</id>
<content type='text'>
This is a hack-y variant of GroupFatcatWorksSubsetJob which allows
setting different left and right sides of the join. The initial
application is to re-run work merging with only longtail-oa works on the
"left", with the goal of hard-merging these releases into existing
releases with actual identifiers (instead of just grouping into works).

As a refactor, the normal GroupFatcatWorksJob could just be this with
the same file passed as both left and right, though that requires twice
as much JSON parsing/filtering.
</content>
</entry>
<entry>
<title>please command for groupworksfatcat</title>
<updated>2019-08-11T03:20:34+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-08-11T02:50:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=952457f9fae1cc25cdeeefc00e19ae20cf86c659'/>
<id>urn:sha1:952457f9fae1cc25cdeeefc00e19ae20cf86c659</id>
<content type='text'>
</content>
</entry>
<entry>
<title>please: add staging config (commented out)</title>
<updated>2019-07-08T02:40:37+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-07-08T02:40:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=4298e1f9c7a092602e1cbe46add13936cb6169e7'/>
<id>urn:sha1:4298e1f9c7a092602e1cbe46add13936cb6169e7</id>
<content type='text'>
</content>
</entry>
<entry>
<title>scalding dump-grobid-status-code job</title>
<updated>2019-04-12T21:19:29+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-04-12T20:48:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=d93ebaa691f8b200a5761850b4533a153cb457ee'/>
<id>urn:sha1:d93ebaa691f8b200a5761850b4533a153cb457ee</id>
<content type='text'>
</content>
</entry>
<entry>
<title>set long timeout on HBaseStatusCountJob</title>
<updated>2019-02-26T19:27:26+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-02-26T19:27:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=939ef856b204c29ab6ab4038ff698a402a934c8f'/>
<id>urn:sha1:939ef856b204c29ab6ab4038ff698a402a934c8f</id>
<content type='text'>
</content>
</entry>
<entry>
<title>longer match-crossref timeout</title>
<updated>2018-12-18T22:36:55+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-12-18T22:36:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=e8ba7a0bc8d4924f6601b4c82ead58e9f69d8aca'/>
<id>urn:sha1:e8ba7a0bc8d4924f6601b4c82ead58e9f69d8aca</id>
<content type='text'>
</content>
</entry>
<entry>
<title>please support DumpGrobidXmlJob</title>
<updated>2018-10-30T21:43:57+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-10-30T21:43:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=ee9230fd0fe96a07006b496d678681bae47cb943'/>
<id>urn:sha1:ee9230fd0fe96a07006b496d678681bae47cb943</id>
<content type='text'>
</content>
</entry>
<entry>
<title>please support for DumpGrobidMetaInsertableJob</title>
<updated>2018-09-23T03:33:17+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-09-23T03:33:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=4a5912e23ae8d58edad64931ed290779c0e1689c'/>
<id>urn:sha1:4a5912e23ae8d58edad64931ed290779c0e1689c</id>
<content type='text'>
</content>
</entry>
<entry>
<title>dumpfilemeta support in please</title>
<updated>2018-09-14T04:14:22+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-09-14T04:14:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=52a88e314329ab5bac6217b2b3f2fcbe99740318'/>
<id>urn:sha1:52a88e314329ab5bac6217b2b3f2fcbe99740318</id>
<content type='text'>
</content>
</entry>
</feed>
