<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sandcrawler/python, branch bnewbold-refactor-loggging</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-refactor-loggging</id>
<link rel='self' href='https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-refactor-loggging'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/'/>
<updated>2022-07-12T22:03:29+00:00</updated>
<entry>
<title>WIP: refactor logging calls in ingest pipelines</title>
<updated>2022-07-12T22:03:29+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-07-12T22:03:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=c15432c0ce52c48efabcd7e3221a5d625ef3e9d0'/>
<id>urn:sha1:c15432c0ce52c48efabcd7e3221a5d625ef3e9d0</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ingest: IEEE domain is blocking us</title>
<updated>2022-07-07T20:17:49+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-07-07T20:17:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=695a80a64f02f4c23bb938ecfffeef146344841f'/>
<id>urn:sha1:695a80a64f02f4c23bb938ecfffeef146344841f</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ingest: catch more ConnectionErrors (SPN, replay fetch, GROBID)</title>
<updated>2022-05-16T22:02:02+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-05-16T22:02:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=fcc5a1648d2e49e7002ca569ed668d3318a75584'/>
<id>urn:sha1:fcc5a1648d2e49e7002ca569ed668d3318a75584</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ingest: skip arxiv.org DOIs, we already direct-ingest</title>
<updated>2022-05-11T19:19:48+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-05-11T19:19:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=1534ff4d05c6fca460e82b5707fe3fbdc3504e50'/>
<id>urn:sha1:1534ff4d05c6fca460e82b5707fe3fbdc3504e50</id>
<content type='text'>
</content>
</entry>
<entry>
<title>python make fmt</title>
<updated>2022-05-05T18:21:35+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-05-05T18:21:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=a0214959c10a5ecb794d78b189a767ac01c0af48'/>
<id>urn:sha1:a0214959c10a5ecb794d78b189a767ac01c0af48</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ingest spn2: fix tests</title>
<updated>2022-05-05T18:21:29+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-05-05T18:21:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=21ad5cd9942044939c8203dd076ea080b6d55a61'/>
<id>urn:sha1:21ad5cd9942044939c8203dd076ea080b6d55a61</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ingest: more loginwall patterns</title>
<updated>2022-05-05T18:08:52+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-05-05T18:08:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=1f9ca570bd168154a72adcd2454b992dbc7e8d0a'/>
<id>urn:sha1:1f9ca570bd168154a72adcd2454b992dbc7e8d0a</id>
<content type='text'>
</content>
</entry>
<entry>
<title>ingest_tool: fix arg parsing</title>
<updated>2022-05-04T00:35:52+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-05-04T00:35:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=1ec661af75f37b3ae5031851f6c452039e08503c'/>
<id>urn:sha1:1ec661af75f37b3ae5031851f6c452039e08503c</id>
<content type='text'>
</content>
</entry>
<entry>
<title>switch default kafka-broker host from wbgrp-svc263 to wbgrp-svc350</title>
<updated>2022-05-04T00:12:48+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-05-04T00:12:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=00ae74378413e87f230c88113ff8163a6f969d63'/>
<id>urn:sha1:00ae74378413e87f230c88113ff8163a6f969d63</id>
<content type='text'>
</content>
</entry>
<entry>
<title>SPNv2: several fixes for prod throughput</title>
<updated>2022-04-26T22:25:23+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2022-04-26T22:25:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=8f9240fed272367669b535b1334e280c588a1791'/>
<id>urn:sha1:8f9240fed272367669b535b1334e280c588a1791</id>
<content type='text'>
Most importantly, for some API flags, if the value is not true-thy, do
not set the flag at all. Setting any flag was resulting in screenshots
and outlinks actually getting created/captured, which was a huge
slowdown.

Also, check per-user SPNv2 slots available, using API, before requesting
an actual capture.
</content>
</entry>
</feed>
