<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sandcrawler/mapreduce, branch master</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/sandcrawler/atom?h=master</id>
<link rel='self' href='https://git.bnewbold.net/sandcrawler/atom?h=master'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/'/>
<updated>2018-08-24T19:28:51+00:00</updated>
<entry>
<title>rename ./mapreduce to ./python</title>
<updated>2018-08-24T19:28:51+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-08-24T19:28:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=3782311e29b7e477e1936c89f55ff6483fd02e65'/>
<id>urn:sha1:3782311e29b7e477e1936c89f55ff6483fd02e65</id>
<content type='text'>
</content>
</entry>
<entry>
<title>extraction: do want content, not text</title>
<updated>2018-08-21T23:28:55+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-08-21T23:28:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=6c92ee4c0b137c28abd03ed72190210da8a1e72b'/>
<id>urn:sha1:6c92ee4c0b137c28abd03ed72190210da8a1e72b</id>
<content type='text'>
XML can have non-unicode characters? Who knew.
</content>
</entry>
<entry>
<title>extraction: status reporting tweaks</title>
<updated>2018-08-21T17:36:09+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-08-21T17:36:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=139ca7e5a90d49c33e23de781b7e4ac21e868fac'/>
<id>urn:sha1:139ca7e5a90d49c33e23de781b7e4ac21e868fac</id>
<content type='text'>
Improvements to how the extraction function in the extraction script
reports status (in output, hbase, and counters)
</content>
</entry>
<entry>
<title>monkey-patch SHA-1 blacklist</title>
<updated>2018-07-05T20:35:07+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-07-05T20:35:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=41c26d2b9ac215f2dacd15f6b7433b909cafb552'/>
<id>urn:sha1:41c26d2b9ac215f2dacd15f6b7433b909cafb552</id>
<content type='text'>
</content>
</entry>
<entry>
<title>doc improvements and fixes to 'please' helper</title>
<updated>2018-06-15T00:41:33+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-06-15T00:41:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=c23ccd1f2d03ad65ee83b8eca8c407d12ecd54e1'/>
<id>urn:sha1:c23ccd1f2d03ad65ee83b8eca8c407d12ecd54e1</id>
<content type='text'>
</content>
</entry>
<entry>
<title>bnewbold-dev &gt; wbgrp-svc263</title>
<updated>2018-06-04T19:42:29+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-05-31T02:40:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=d2dd016aa8da93ad14654237dbb7cfac214f9da8'/>
<id>urn:sha1:d2dd016aa8da93ad14654237dbb7cfac214f9da8</id>
<content type='text'>
This is a new production VM running an HBase-Thrift gateway
</content>
</entry>
<entry>
<title>actually fix oversize inserts</title>
<updated>2018-05-08T03:24:01+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-05-08T03:24:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=0c398392aa298d28694bf5bd37d3e4912de8a2f5'/>
<id>urn:sha1:0c398392aa298d28694bf5bd37d3e4912de8a2f5</id>
<content type='text'>
</content>
</entry>
<entry>
<title>XML size limit</title>
<updated>2018-04-26T18:24:42+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-04-26T18:24:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=ee6ce29e7987f936536a0ef128d3a96cc1df3d86'/>
<id>urn:sha1:ee6ce29e7987f936536a0ef128d3a96cc1df3d86</id>
<content type='text'>
</content>
</entry>
<entry>
<title>force_existing flag for extraction</title>
<updated>2018-04-19T05:15:02+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-04-19T05:14:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=df23b6f45922875f0bf657aea3b8c3fb4451469d'/>
<id>urn:sha1:df23b6f45922875f0bf657aea3b8c3fb4451469d</id>
<content type='text'>
</content>
</entry>
<entry>
<title>NLineInputFormat requires RawProtocol</title>
<updated>2018-04-19T05:15:02+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-04-15T05:44:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=e0d1e381bf536d1c077546526c21eab909444193'/>
<id>urn:sha1:e0d1e381bf536d1c077546526c21eab909444193</id>
<content type='text'>
Should make this a command line argument or something. Want one in
HADOOP, the other for local/tests/inline/etc.
</content>
</entry>
</feed>
