<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sandcrawler/python, branch bnewbold-persist-grobid-errors</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-persist-grobid-errors</id>
<link rel='self' href='https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-persist-grobid-errors'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/'/>
<updated>2020-01-29T03:10:40+00:00</updated>
<entry>
<title>grobid persist: if status_code is not set, default to 0</title>
<updated>2020-01-29T03:10:40+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-29T03:06:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=480dc3fa20102ba0a15013954e76d3b1f826026c'/>
<id>urn:sha1:480dc3fa20102ba0a15013954e76d3b1f826026c</id>
<content type='text'>
We have to set something currently because of a NOT NULL constraint on
the table.

Originally I thought we would just not record rows if there was an
error, and that is still sort of a valid stance. However, when doing
bulk GROBID-ing from cdx table, there exist some "bad" CDX rows which
cause wayback or petabox errors. We should fix bugs or delete these rows
as a cleanup, but until that happens we should record the error state so
we don't loop forever.

One danger of this commit is that we can clobber existing good rows with
new errors rapidly if there is wayback downtime or something like that.
</content>
</entry>
<entry>
<title>workers: yes, poll is necessary</title>
<updated>2020-01-29T03:10:40+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-29T03:09:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=6448a4bb41cd9301bc5c6c7ea0bd8b12c2423e39'/>
<id>urn:sha1:6448a4bb41cd9301bc5c6c7ea0bd8b12c2423e39</id>
<content type='text'>
</content>
</entry>
<entry>
<title>grobid worker: always set a key in response</title>
<updated>2020-01-29T02:55:39+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-29T02:55:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=08377ca3fdb7103ce0e0a98f7ae9e2baa39febf8'/>
<id>urn:sha1:08377ca3fdb7103ce0e0a98f7ae9e2baa39febf8</id>
<content type='text'>
We have key-based compaction enabled for the GROBID output topic. This
means it is an error to public to that topic without a key set.

Hopefully this change will end these errors, which look like:

  KafkaError{code=INVALID_MSG,val=2,str="Broker: Invalid message"}
</content>
</entry>
<entry>
<title>fix kafka worker partition-specific error</title>
<updated>2020-01-28T21:07:16+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-28T21:07:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=e0c2cc4b1a41b5de40c9e3adc9cba36d4dc93ed1'/>
<id>urn:sha1:e0c2cc4b1a41b5de40c9e3adc9cba36d4dc93ed1</id>
<content type='text'>
</content>
</entry>
<entry>
<title>fix WaybackError exception formating</title>
<updated>2020-01-28T21:04:33+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-28T21:04:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=f22390203f03907f3a49ff9a24fb1a5ec40c65f1'/>
<id>urn:sha1:f22390203f03907f3a49ff9a24fb1a5ec40c65f1</id>
<content type='text'>
</content>
</entry>
<entry>
<title>fix elif syntax error</title>
<updated>2020-01-28T20:55:32+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-28T20:55:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=446c5679c2c4299e6e6766277acf2956779669f1'/>
<id>urn:sha1:446c5679c2c4299e6e6766277acf2956779669f1</id>
<content type='text'>
</content>
</entry>
<entry>
<title>block springer page-one domain</title>
<updated>2020-01-28T20:52:48+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-28T20:52:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=b9237268c61777a28f5d8e512b326337715aab44'/>
<id>urn:sha1:b9237268c61777a28f5d8e512b326337715aab44</id>
<content type='text'>
</content>
</entry>
<entry>
<title>clarify petabox fetch behavior</title>
<updated>2020-01-28T20:52:24+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-28T20:52:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=084807ee51f6b5844b323a1217a70b2f12ee966d'/>
<id>urn:sha1:084807ee51f6b5844b323a1217a70b2f12ee966d</id>
<content type='text'>
</content>
</entry>
<entry>
<title>re-enable figshare and zenodo crawling</title>
<updated>2020-01-21T19:34:39+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-21T19:34:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=2e93f94c9ebba689dde252ca8f5b106765cece88'/>
<id>urn:sha1:2e93f94c9ebba689dde252ca8f5b106765cece88</id>
<content type='text'>
For daily imports
</content>
</entry>
<entry>
<title>persist grobid: actually, status_code is required</title>
<updated>2020-01-21T19:32:51+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-21T19:32:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=20291471b34ea559d2ea5d45f3b05884e54d179a'/>
<id>urn:sha1:20291471b34ea559d2ea5d45f3b05884e54d179a</id>
<content type='text'>
Instead of working around when missing, force it to exist but skip in
database insert section.

Disk mode still needs to check if blank.
</content>
</entry>
</feed>
