<feed xmlns='http://www.w3.org/2005/Atom'>
<title>sandcrawler/pig, branch bnewbold-refactor-loggging</title>
<subtitle>[no description]</subtitle>
<id>https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-refactor-loggging</id>
<link rel='self' href='https://git.bnewbold.net/sandcrawler/atom?h=bnewbold-refactor-loggging'/>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/'/>
<updated>2020-01-03T02:12:58+00:00</updated>
<entry>
<title>small (syntax?) changes to pig join script</title>
<updated>2020-01-03T02:12:58+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-12-28T01:21:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=028a0c27a832833e8833e3b3d0e1d6725a48e953'/>
<id>urn:sha1:028a0c27a832833e8833e3b3d0e1d6725a48e953</id>
<content type='text'>
</content>
</entry>
<entry>
<title>pig: first rev of join-cdx-sha1 script</title>
<updated>2019-12-22T22:26:04+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-12-22T22:26:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=28de71e714c1f5d70adcfd3213dc2433a701a430'/>
<id>urn:sha1:28de71e714c1f5d70adcfd3213dc2433a701a430</id>
<content type='text'>
</content>
</entry>
<entry>
<title>pig: move count_lines helper to pighelper.py</title>
<updated>2019-12-22T22:21:25+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-12-22T22:21:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=990f28b283bce4a524b4e32178f45e40214ee0de'/>
<id>urn:sha1:990f28b283bce4a524b4e32178f45e40214ee0de</id>
<content type='text'>
</content>
</entry>
<entry>
<title>new/additional GWB CDX filter scripts</title>
<updated>2019-10-17T16:19:34+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-10-17T16:19:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=54dabe601eaa19d0495d9a102b34e9daa056457d'/>
<id>urn:sha1:54dabe601eaa19d0495d9a102b34e9daa056457d</id>
<content type='text'>
</content>
</entry>
<entry>
<title>add ojs and dspace as in-domain patterns to look for in heuristic CDX PDF filter</title>
<updated>2019-04-12T21:19:29+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2019-04-12T03:14:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=8ac10ab7fe310df55ab5a66d741ea25c24389418'/>
<id>urn:sha1:8ac10ab7fe310df55ab5a66d741ea25c24389418</id>
<content type='text'>
</content>
</entry>
<entry>
<title>rework fetch_hadoop script</title>
<updated>2018-08-24T19:05:41+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-08-24T19:05:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=92584ec4201ecc27af423cbff7b4bc1573edf175'/>
<id>urn:sha1:92584ec4201ecc27af423cbff7b4bc1573edf175</id>
<content type='text'>
Should work on macOS now, and fetches hadoop in addition to pig. Still
requires wget (not installed by default on macOS).
</content>
</entry>
<entry>
<title>commit old tweak to pig script (from cluster)</title>
<updated>2018-07-06T14:48:57+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-07-06T14:48:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=b3149a911df8150056c54f86cd77c3516fc9838c'/>
<id>urn:sha1:b3149a911df8150056c54f86cd77c3516fc9838c</id>
<content type='text'>
</content>
</entry>
<entry>
<title>possibly-broken version of hbase-count-rows.pig</title>
<updated>2018-07-06T14:48:57+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-05-01T22:33:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=fde027fb269214a7e8f71f5717bcf569014b6661'/>
<id>urn:sha1:fde027fb269214a7e8f71f5717bcf569014b6661</id>
<content type='text'>
This just worked a minute ago, but now throws:

org.apache.hadoop.hbase.DoNotRetryIOException: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/ByteStringer
</content>
</entry>
<entry>
<title>fix tests post-DISTINCT</title>
<updated>2018-05-08T17:06:20+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-05-08T17:06:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=18a55d37a87d4391bd8161201c523dd7d7f0f1e7'/>
<id>urn:sha1:18a55d37a87d4391bd8161201c523dd7d7f0f1e7</id>
<content type='text'>
Confirms it's working!
</content>
</entry>
<entry>
<title>distinct on SHA1 in cdx scripts</title>
<updated>2018-05-08T16:58:24+00:00</updated>
<author>
<name>Bryan Newbold</name>
<email>bnewbold@archive.org</email>
</author>
<published>2018-05-08T16:58:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.bnewbold.net/sandcrawler/commit/?id=1831a3b4495aee275e4b4b187fa545eba75eb87b'/>
<id>urn:sha1:1831a3b4495aee275e4b4b187fa545eba75eb87b</id>
<content type='text'>
</content>
</entry>
</feed>
