Commit message (Expand) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | Added repository to find com.hadoop.gplcompression#hadoop-lzo;0.4.16. | Ellen Spertus | 2018-05-22 | 1 | -0/+1 |
* | more tests (failing) | Bryan Newbold | 2018-05-22 | 2 | -1/+56 |
* | update README with invocations | Bryan Newbold | 2018-05-21 | 1 | -0/+13 |
* | point SimpleHBaseSourceExample to actual zookeeper quorum host | Bryan Newbold | 2018-05-21 | 1 | -1/+2 |
* | another attempt at a simple job variation | Bryan Newbold | 2018-05-21 | 1 | -3/+16 |
* | update HBaseRowCountJob based on Simple example | Bryan Newbold | 2018-05-21 | 1 | -10/+11 |
* | spyglass/hbase test examples (from upstream) | Bryan Newbold | 2018-05-21 | 2 | -0/+93 |
* | deps updates: cdh libs, hbase, custom spyglass | Bryan Newbold | 2018-05-21 | 2 | -3/+7 |
* | docs of how to munge around custom spyglass jars | Bryan Newbold | 2018-05-21 | 1 | -0/+19 |
* | add dependencyTree helper plugin | Bryan Newbold | 2018-05-21 | 2 | -1/+2 |
* | building (but nullpointer) spyglass integration | Bryan Newbold | 2018-05-21 | 2 | -3/+27 |
* | more deps locations | Bryan Newbold | 2018-05-21 | 1 | -0/+8 |
* | gitignore for scalding directory | Bryan Newbold | 2018-05-21 | 1 | -0/+3 |
* | fix WordCountJob package; tests; hadoop version | Bryan Newbold | 2018-05-21 | 3 | -2/+43 |
* | WordCount -> WordCountJob | Bryan Newbold | 2018-05-21 | 3 | -13/+13 |
* | success running with com.twitter.scalding.Tool | Bryan Newbold | 2018-05-21 | 2 | -4/+11 |
* | remove main function; class name same as file | Bryan Newbold | 2018-05-21 | 1 | -12/+1 |
* | copy in jvm ecosystem notes | Bryan Newbold | 2018-05-21 | 1 | -0/+46 |
* | copy in scalding learning example | Bryan Newbold | 2018-05-21 | 6 | -0/+93 |
* | jvm/scala/scalding setup notes | Bryan Newbold | 2018-05-17 | 1 | -0/+16 |
* | fix tests post-DISTINCT | Bryan Newbold | 2018-05-08 | 5 | -25/+30 |
* | distinct on SHA1 in cdx scripts | Bryan Newbold | 2018-05-08 | 2 | -6/+18 |
* | pig cdx join improvements | Bryan Newbold | 2018-05-08 | 1 | -1/+1 |
* | how to run pig in production | Bryan Newbold | 2018-05-08 | 1 | -0/+5 |
* | WIP on filter-cdx-join-urls.pig | Bryan Newbold | 2018-05-07 | 1 | -0/+37 |
* | Merge branch 'master' of git.archive.org:webgroup/sandcrawler | Bryan Newbold | 2018-05-08 | 8 | -3/+139 |
|\ | |||||
| * | stale TODO | Bryan Newbold | 2018-05-07 | 1 | -0/+1 |
| * | pig script to filter GWB CDX by SURT regexes | Bryan Newbold | 2018-05-07 | 6 | -0/+127 |
| * | improve pig helper | Bryan Newbold | 2018-05-07 | 1 | -3/+11 |
* | | actually fix oversize inserts | Bryan Newbold | 2018-05-08 | 1 | -7/+7 |
|/ | |||||
* | XML size limit | Bryan Newbold | 2018-04-26 | 1 | -0/+6 |
* | force_existing flag for extraction | Bryan Newbold | 2018-04-19 | 1 | -1/+5 |
* | NLineInputFormat requires RawProtocol | Bryan Newbold | 2018-04-19 | 1 | -1/+2 |
* | local mrjob config | Bryan Newbold | 2018-04-19 | 1 | -0/+6 |
* | switch to new (local) sentry instance | Bryan Newbold | 2018-04-18 | 1 | -1/+1 |
* | notes on attempted vinay setup | Bryan Newbold | 2018-04-18 | 2 | -1/+9 |
* | start adding macOS instructions | Bryan Newbold | 2018-04-16 | 1 | -0/+4 |
* | update Pipfile.lock (new pluggy) | Bryan Newbold | 2018-04-16 | 1 | -59/+64 |
* | use NLineInputFormat so we can control split size | Bryan Newbold | 2018-04-11 | 1 | -0/+1 |
* | revert PYTHONPATH in cmdenv | Bryan Newbold | 2018-04-11 | 1 | -1/+2 |
* | Merge branch 'bnewbold-sentry' | Bryan Newbold | 2018-04-10 | 4 | -19/+31 |
|\ | |||||
| * | prototype sentry integration | Bryan Newbold | 2018-04-10 | 4 | -19/+31 |
* | | don't try to decode GROBID output | Bryan Newbold | 2018-04-11 | 1 | -2/+2 |
|/ | |||||
* | partially lint extraction_cdx_grobid.py | Bryan Newbold | 2018-04-10 | 1 | -8/+6 |
* | yet more test improvements | Bryan Newbold | 2018-04-10 | 2 | -9/+61 |
* | cleanup tests; add one for double-processing | Bryan Newbold | 2018-04-10 | 2 | -20/+43 |
* | TODO updates | Bryan Newbold | 2018-04-10 | 3 | -18/+3 |
* | wayback 404 test | Bryan Newbold | 2018-04-10 | 2 | -5/+49 |
* | extraction test fixes | Bryan Newbold | 2018-04-10 | 2 | -27/+50 |
* | grobid2json test fixes | Bryan Newbold | 2018-04-10 | 2 | -1/+3 |