<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sandcrawler/sql, branch bnewbold-persist-grobid-errors</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-persist-grobid-errors</id>
<link rel='self' href='https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-persist-grobid-errors'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/'/>
<updated>2020-01-29T03:10:40Z</updated>
<entry>
<title>grobid persist: if status_code is not set, default to 0</title>
<updated>2020-01-29T03:10:40Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-29T03:06:25Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=480dc3fa20102ba0a15013954e76d3b1f826026c'/>
<id>urn:sha1:480dc3fa20102ba0a15013954e76d3b1f826026c</id>
<content type='text'>
We have to set something currently because of a NOT NULL constraint on
the table.

Originally I thought we would just not record rows if there was an
error, and that is still sort of a valid stance. However, when doing
bulk GROBID-ing from cdx table, there exist some "bad" CDX rows which
cause wayback or petabox errors. We should fix bugs or delete these rows
as a cleanup, but until that happens we should record the error state so
we don't loop forever.

One danger of this commit is that we can clobber existing good rows with
new errors rapidly if there is wayback downtime or something like that.
</content>
</entry>
<entry>
<title>sql stats: typo fix</title>
<updated>2020-01-29T03:10:40Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-29T03:10:21Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=9f53880c746b9fd84261e3ab7dbbee81501df394'/>
<id>urn:sha1:9f53880c746b9fd84261e3ab7dbbee81501df394</id>
<content type='text'>
</content>
</entry>
<entry>
<title>sql howto: database dumps</title>
<updated>2020-01-29T03:10:40Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-29T03:09:54Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=46171ed3a86e90d8c519a8ce94f379309936fbeb'/>
<id>urn:sha1:46171ed3a86e90d8c519a8ce94f379309936fbeb</id>
<content type='text'>
</content>
</entry>
<entry>
<title>clarify ingest result schema and semantics</title>
<updated>2020-01-15T21:54:02Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-15T21:54:02Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=d06fd45e3c86cb080ad7724f3fc7575750a9cd69'/>
<id>urn:sha1:d06fd45e3c86cb080ad7724f3fc7575750a9cd69</id>
<content type='text'>
</content>
</entry>
<entry>
<title>database stats</title>
<updated>2020-01-15T07:52:33Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-15T07:52:33Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=9c97db0ffcb2350a7231ab388c643d953d77274f'/>
<id>urn:sha1:9c97db0ffcb2350a7231ab388c643d953d77274f</id>
<content type='text'>
</content>
</entry>
<entry>
<title>sql: more cool random queries</title>
<updated>2020-01-03T02:13:03Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2020-01-03T02:11:24Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=fdff17d2cdf4ac92a0458403bb4ca0b073a7752b'/>
<id>urn:sha1:fdff17d2cdf4ac92a0458403bb4ca0b073a7752b</id>
<content type='text'>
</content>
</entry>
<entry>
<title>SQL docs update for diesel change</title>
<updated>2020-01-03T02:12:58Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-12-25T00:35:45Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=b603a43e45eb3afa01efd0902c5af56f29d979a2'/>
<id>urn:sha1:b603a43e45eb3afa01efd0902c5af56f29d979a2</id>
<content type='text'>
</content>
</entry>
<entry>
<title>move SQL schema to diesel migration pattern</title>
<updated>2020-01-03T02:12:58Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-12-19T06:08:38Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=9beb3caee51c6bb0403c658a71c965dde4c8e55b'/>
<id>urn:sha1:9beb3caee51c6bb0403c658a71c965dde4c8e55b</id>
<content type='text'>
</content>
</entry>
<entry>
<title>add some GROBID metadata schema docs to SQL schema</title>
<updated>2019-12-12T02:20:13Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-12-12T02:20:13Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=91f5f53c90742c80890e3bd44fdc9044555b8209'/>
<id>urn:sha1:91f5f53c90742c80890e3bd44fdc9044555b8209</id>
<content type='text'>
</content>
</entry>
<entry>
<title>add note to CDX backfill script that we should be filtering (oops)</title>
<updated>2019-11-12T21:22:43Z</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-11-12T21:22:43Z</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=9529cbb2660897ce3ffe3986f60eafbf3596495d'/>
<id>urn:sha1:9529cbb2660897ce3ffe3986f60eafbf3596495d</id>
<content type='text'>
</content>
</entry>
</feed>
