aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* fetch SpyGlass jar from archive.org (not local)Bryan Newbold2018-06-042-19/+7
|
* Provided full path to cascading jar in command line.Ellen Spertus2018-05-311-1/+1
|
* Added tip on OutOfMemoryError.Ellen Spertus2018-05-311-1/+5
|
* Added debugging info for cascading.tuple.Fields.Ellen Spertus2018-05-311-1/+23
|
* switch HBaseRowCountJob to SCAN_ALLBryan Newbold2018-05-292-5/+11
|
* HBaseRowCountJob actually counts rowsBryan Newbold2018-05-292-13/+8
|
* update version and project nameBryan Newbold2018-05-243-4/+6
|
* cleanup scalding notes/READMEBryan Newbold2018-05-243-37/+162
|
* assemblyMergeStrategy deprecation warningBryan Newbold2018-05-241-2/+2
|
* rename jvm/scalding directoriesBryan Newbold2018-05-2414-71/+0
|
* fix up HBaseRowCountTestBryan Newbold2018-05-242-7/+15
| | | | | Again, seems like test fixture must match *exactly* or very obscure errors crop up.
* get quorum fields to match, fixing testBryan Newbold2018-05-241-1/+1
| | | | | | | | | | | Writing this commit message in anger: It seems that the HBaseSource must match exactly between the instantiated Job class and the JobTest. The error when this isn't the case is very obscure: a `None.get()` exception deep in SpyGlass internals. Blech. This may or may not explain other test failure issues.
* Added repository to find com.hadoop.gplcompression#hadoop-lzo;0.4.16.Ellen Spertus2018-05-221-0/+1
|
* more tests (failing)Bryan Newbold2018-05-222-1/+56
|
* update README with invocationsBryan Newbold2018-05-211-0/+13
|
* point SimpleHBaseSourceExample to actual zookeeper quorum hostBryan Newbold2018-05-211-1/+2
|
* another attempt at a simple job variationBryan Newbold2018-05-211-3/+16
|
* update HBaseRowCountJob based on Simple exampleBryan Newbold2018-05-211-10/+11
|
* spyglass/hbase test examples (from upstream)Bryan Newbold2018-05-212-0/+93
|
* deps updates: cdh libs, hbase, custom spyglassBryan Newbold2018-05-212-3/+7
|
* docs of how to munge around custom spyglass jarsBryan Newbold2018-05-211-0/+19
|
* add dependencyTree helper pluginBryan Newbold2018-05-212-1/+2
|
* building (but nullpointer) spyglass integrationBryan Newbold2018-05-212-3/+27
|
* more deps locationsBryan Newbold2018-05-211-0/+8
|
* gitignore for scalding directoryBryan Newbold2018-05-211-0/+3
|
* fix WordCountJob package; tests; hadoop versionBryan Newbold2018-05-213-2/+43
| | | | | | | | | | | When copying from upstream scalding, forgot to change the path/namespace of the WordCountJob. Production IA cluster is actually running Hadoop 2.5, not 2.6 (I keep forgetting). Pull in more dependencies so test runs (copied from scalding repo, only changed the namespace of the job)
* WordCount -> WordCountJobBryan Newbold2018-05-213-13/+13
| | | | Also use the exact file from scalding repo
* success running with com.twitter.scalding.ToolBryan Newbold2018-05-212-4/+11
|
* remove main function; class name same as fileBryan Newbold2018-05-211-12/+1
|
* copy in jvm ecosystem notesBryan Newbold2018-05-211-0/+46
|
* copy in scalding learning exampleBryan Newbold2018-05-216-0/+93
|
* jvm/scala/scalding setup notesBryan Newbold2018-05-171-0/+16
|
* fix tests post-DISTINCTBryan Newbold2018-05-085-25/+30
| | | | Confirms it's working!
* distinct on SHA1 in cdx scriptsBryan Newbold2018-05-082-6/+18
|
* pig cdx join improvementsBryan Newbold2018-05-081-1/+1
|
* how to run pig in productionBryan Newbold2018-05-081-0/+5
|
* WIP on filter-cdx-join-urls.pigBryan Newbold2018-05-071-0/+37
|
* Merge branch 'master' of git.archive.org:webgroup/sandcrawlerBryan Newbold2018-05-088-3/+139
|\
| * stale TODOBryan Newbold2018-05-071-0/+1
| |
| * pig script to filter GWB CDX by SURT regexesBryan Newbold2018-05-076-0/+127
| |
| * improve pig helperBryan Newbold2018-05-071-3/+11
| |
* | actually fix oversize insertsBryan Newbold2018-05-081-7/+7
|/
* XML size limitBryan Newbold2018-04-261-0/+6
|
* force_existing flag for extractionBryan Newbold2018-04-191-1/+5
|
* NLineInputFormat requires RawProtocolBryan Newbold2018-04-191-1/+2
| | | | | Should make this a command line argument or something. Want one in HADOOP, the other for local/tests/inline/etc.
* local mrjob configBryan Newbold2018-04-191-0/+6
|
* switch to new (local) sentry instanceBryan Newbold2018-04-181-1/+1
|
* notes on attempted vinay setupBryan Newbold2018-04-182-1/+9
|
* start adding macOS instructionsBryan Newbold2018-04-161-0/+4
|
* update Pipfile.lock (new pluggy)Bryan Newbold2018-04-161-59/+64
|