<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sandcrawler/python, branch master</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/sandcrawler/atom?h=master</id>
<link rel='self' href='https://git.bnewbold.net/sandcrawler/atom?h=master'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/'/>
<updated>2023-01-05T03:38:16+00:00</updated>
<entry>
<title>pytest: skip warning in gwb</title>
<updated>2023-01-05T03:38:16+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2023-01-05T03:38:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=ebc7b6a74b7cc2297c9c291e74ef5466a4753b25'/>
<id>urn:sha1:ebc7b6a74b7cc2297c9c291e74ef5466a4753b25</id>
<content type='text'>
</content>
</entry>
<entry>
<title>mypy lint fixes</title>
<updated>2023-01-05T03:37:07+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2023-01-05T03:37:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=5f73b6428f4b505880ef02429d57f11dc50d98e5'/>
<id>urn:sha1:5f73b6428f4b505880ef02429d57f11dc50d98e5</id>
<content type='text'>
</content>
</entry>
<entry>
<title>python-specific README file</title>
<updated>2023-01-03T03:10:01+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2023-01-03T03:10:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=e433990172c157707d92452652aefe2f21b6a4a0'/>
<id>urn:sha1:e433990172c157707d92452652aefe2f21b6a4a0</id>
<content type='text'>
</content>
</entry>
<entry>
<title>bump python deps</title>
<updated>2022-12-23T23:54:33+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-12-23T23:54:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=b7e4629f3c84f35af5ad62346a9480bea957c719'/>
<id>urn:sha1:b7e4629f3c84f35af5ad62346a9480bea957c719</id>
<content type='text'>
</content>
</entry>
<entry>
<title>bad pdf hash</title>
<updated>2022-12-16T19:17:24+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-12-16T19:17:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=b42103eb35f8eb55bda03facb8b14a366fd544c2'/>
<id>urn:sha1:b42103eb35f8eb55bda03facb8b14a366fd544c2</id>
<content type='text'>
</content>
</entry>
<entry>
<title>sandcrawler: try to handle weird CDX API response</title>
<updated>2022-11-01T23:43:40+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-11-01T23:43:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=dcc0fe1a61c6816e519cfad95ec12d8abe5ddd29'/>
<id>urn:sha1:dcc0fe1a61c6816e519cfad95ec12d8abe5ddd29</id>
<content type='text'>
Hard to debug this because sentry is broken.
</content>
</entry>
<entry>
<title>ingest: more generic OJS support, including pre-prints</title>
<updated>2022-10-25T01:35:04+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-10-25T01:35:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=a90b604c189bc5655d4a050a9241dfe0b34dbc5b'/>
<id>urn:sha1:a90b604c189bc5655d4a050a9241dfe0b34dbc5b</id>
<content type='text'>
There were some '/article/view/' patterns which can also be, eg,
'/preprint/view/'.
</content>
</entry>
<entry>
<title>ingest: more generic PDF fulltext URL patterns</title>
<updated>2022-10-24T21:22:39+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-10-24T21:22:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=d8f82f5836004d394a419574c50f0636369c94d7'/>
<id>urn:sha1:d8f82f5836004d394a419574c50f0636369c94d7</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ingest: another wall pattern, and check for walls in more places</title>
<updated>2022-10-24T21:22:17+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-10-24T21:22:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=5563cb5121c94efcf1819b915e7e7c602215a6e5'/>
<id>urn:sha1:5563cb5121c94efcf1819b915e7e7c602215a6e5</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ingest: don't prefer WARC over SPN so strongly</title>
<updated>2022-10-24T21:17:46+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-10-24T21:17:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=4f0d10f4b38534eda673a8dfe28e3a58af9a8a8a'/>
<id>urn:sha1:4f0d10f4b38534eda673a8dfe28e3a58af9a8a8a</id>
<content type='text'>
We generally prefer an older WARC record over an SPN record, because the
lookup is easier. But, this was causing problems with repeated ingest,
so demote it.

We may want to make this more configurable in the future, so things like
HTML sub-resource lookups or bulk ingest won't prefer random new SPN
captures.
</content>
</entry>
</feed>
