Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | update version and project name | Bryan Newbold | 2018-05-24 | 3 | -4/+6 | |
| | ||||||
* | cleanup scalding notes/README | Bryan Newbold | 2018-05-24 | 3 | -37/+162 | |
| | ||||||
* | assemblyMergeStrategy deprecation warning | Bryan Newbold | 2018-05-24 | 1 | -2/+2 | |
| | ||||||
* | rename jvm/scalding directories | Bryan Newbold | 2018-05-24 | 14 | -71/+0 | |
| | ||||||
* | fix up HBaseRowCountTest | Bryan Newbold | 2018-05-24 | 2 | -7/+15 | |
| | | | | | Again, seems like test fixture must match *exactly* or very obscure errors crop up. | |||||
* | get quorum fields to match, fixing test | Bryan Newbold | 2018-05-24 | 1 | -1/+1 | |
| | | | | | | | | | | | Writing this commit message in anger: It seems that the HBaseSource must match exactly between the instantiated Job class and the JobTest. The error when this isn't the case is very obscure: a `None.get()` exception deep in SpyGlass internals. Blech. This may or may not explain other test failure issues. | |||||
* | Added repository to find com.hadoop.gplcompression#hadoop-lzo;0.4.16. | Ellen Spertus | 2018-05-22 | 1 | -0/+1 | |
| | ||||||
* | more tests (failing) | Bryan Newbold | 2018-05-22 | 2 | -1/+56 | |
| | ||||||
* | update README with invocations | Bryan Newbold | 2018-05-21 | 1 | -0/+13 | |
| | ||||||
* | point SimpleHBaseSourceExample to actual zookeeper quorum host | Bryan Newbold | 2018-05-21 | 1 | -1/+2 | |
| | ||||||
* | another attempt at a simple job variation | Bryan Newbold | 2018-05-21 | 1 | -3/+16 | |
| | ||||||
* | update HBaseRowCountJob based on Simple example | Bryan Newbold | 2018-05-21 | 1 | -10/+11 | |
| | ||||||
* | spyglass/hbase test examples (from upstream) | Bryan Newbold | 2018-05-21 | 2 | -0/+93 | |
| | ||||||
* | deps updates: cdh libs, hbase, custom spyglass | Bryan Newbold | 2018-05-21 | 2 | -3/+7 | |
| | ||||||
* | docs of how to munge around custom spyglass jars | Bryan Newbold | 2018-05-21 | 1 | -0/+19 | |
| | ||||||
* | add dependencyTree helper plugin | Bryan Newbold | 2018-05-21 | 2 | -1/+2 | |
| | ||||||
* | building (but nullpointer) spyglass integration | Bryan Newbold | 2018-05-21 | 2 | -3/+27 | |
| | ||||||
* | more deps locations | Bryan Newbold | 2018-05-21 | 1 | -0/+8 | |
| | ||||||
* | gitignore for scalding directory | Bryan Newbold | 2018-05-21 | 1 | -0/+3 | |
| | ||||||
* | fix WordCountJob package; tests; hadoop version | Bryan Newbold | 2018-05-21 | 3 | -2/+43 | |
| | | | | | | | | | | | When copying from upstream scalding, forgot to change the path/namespace of the WordCountJob. Production IA cluster is actually running Hadoop 2.5, not 2.6 (I keep forgetting). Pull in more dependencies so test runs (copied from scalding repo, only changed the namespace of the job) | |||||
* | WordCount -> WordCountJob | Bryan Newbold | 2018-05-21 | 3 | -13/+13 | |
| | | | | Also use the exact file from scalding repo | |||||
* | success running with com.twitter.scalding.Tool | Bryan Newbold | 2018-05-21 | 2 | -4/+11 | |
| | ||||||
* | remove main function; class name same as file | Bryan Newbold | 2018-05-21 | 1 | -12/+1 | |
| | ||||||
* | copy in jvm ecosystem notes | Bryan Newbold | 2018-05-21 | 1 | -0/+46 | |
| | ||||||
* | copy in scalding learning example | Bryan Newbold | 2018-05-21 | 6 | -0/+93 | |
| | ||||||
* | jvm/scala/scalding setup notes | Bryan Newbold | 2018-05-17 | 1 | -0/+16 | |
| | ||||||
* | fix tests post-DISTINCT | Bryan Newbold | 2018-05-08 | 5 | -25/+30 | |
| | | | | Confirms it's working! | |||||
* | distinct on SHA1 in cdx scripts | Bryan Newbold | 2018-05-08 | 2 | -6/+18 | |
| | ||||||
* | pig cdx join improvements | Bryan Newbold | 2018-05-08 | 1 | -1/+1 | |
| | ||||||
* | how to run pig in production | Bryan Newbold | 2018-05-08 | 1 | -0/+5 | |
| | ||||||
* | WIP on filter-cdx-join-urls.pig | Bryan Newbold | 2018-05-07 | 1 | -0/+37 | |
| | ||||||
* | Merge branch 'master' of git.archive.org:webgroup/sandcrawler | Bryan Newbold | 2018-05-08 | 8 | -3/+139 | |
|\ | ||||||
| * | stale TODO | Bryan Newbold | 2018-05-07 | 1 | -0/+1 | |
| | | ||||||
| * | pig script to filter GWB CDX by SURT regexes | Bryan Newbold | 2018-05-07 | 6 | -0/+127 | |
| | | ||||||
| * | improve pig helper | Bryan Newbold | 2018-05-07 | 1 | -3/+11 | |
| | | ||||||
* | | actually fix oversize inserts | Bryan Newbold | 2018-05-08 | 1 | -7/+7 | |
|/ | ||||||
* | XML size limit | Bryan Newbold | 2018-04-26 | 1 | -0/+6 | |
| | ||||||
* | force_existing flag for extraction | Bryan Newbold | 2018-04-19 | 1 | -1/+5 | |
| | ||||||
* | NLineInputFormat requires RawProtocol | Bryan Newbold | 2018-04-19 | 1 | -1/+2 | |
| | | | | | Should make this a command line argument or something. Want one in HADOOP, the other for local/tests/inline/etc. | |||||
* | local mrjob config | Bryan Newbold | 2018-04-19 | 1 | -0/+6 | |
| | ||||||
* | switch to new (local) sentry instance | Bryan Newbold | 2018-04-18 | 1 | -1/+1 | |
| | ||||||
* | notes on attempted vinay setup | Bryan Newbold | 2018-04-18 | 2 | -1/+9 | |
| | ||||||
* | start adding macOS instructions | Bryan Newbold | 2018-04-16 | 1 | -0/+4 | |
| | ||||||
* | update Pipfile.lock (new pluggy) | Bryan Newbold | 2018-04-16 | 1 | -59/+64 | |
| | ||||||
* | use NLineInputFormat so we can control split size | Bryan Newbold | 2018-04-11 | 1 | -0/+1 | |
| | ||||||
* | revert PYTHONPATH in cmdenv | Bryan Newbold | 2018-04-11 | 1 | -1/+2 | |
| | | | | Seemed to break hadoop jobs for some reason | |||||
* | Merge branch 'bnewbold-sentry' | Bryan Newbold | 2018-04-10 | 4 | -19/+31 | |
|\ | ||||||
| * | prototype sentry integration | Bryan Newbold | 2018-04-10 | 4 | -19/+31 | |
| | | ||||||
* | | don't try to decode GROBID output | Bryan Newbold | 2018-04-11 | 1 | -2/+2 | |
|/ | ||||||
* | partially lint extraction_cdx_grobid.py | Bryan Newbold | 2018-04-10 | 1 | -8/+6 | |
| |