Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | gitignore for scalding directory | Bryan Newbold | 2018-05-21 | 1 | -0/+3 |
| | |||||
* | fix WordCountJob package; tests; hadoop version | Bryan Newbold | 2018-05-21 | 3 | -2/+43 |
| | | | | | | | | | | | When copying from upstream scalding, forgot to change the path/namespace of the WordCountJob. Production IA cluster is actually running Hadoop 2.5, not 2.6 (I keep forgetting). Pull in more dependencies so test runs (copied from scalding repo, only changed the namespace of the job) | ||||
* | WordCount -> WordCountJob | Bryan Newbold | 2018-05-21 | 3 | -13/+13 |
| | | | | Also use the exact file from scalding repo | ||||
* | success running with com.twitter.scalding.Tool | Bryan Newbold | 2018-05-21 | 2 | -4/+11 |
| | |||||
* | remove main function; class name same as file | Bryan Newbold | 2018-05-21 | 1 | -12/+1 |
| | |||||
* | copy in jvm ecosystem notes | Bryan Newbold | 2018-05-21 | 1 | -0/+46 |
| | |||||
* | copy in scalding learning example | Bryan Newbold | 2018-05-21 | 6 | -0/+93 |
| | |||||
* | jvm/scala/scalding setup notes | Bryan Newbold | 2018-05-17 | 1 | -0/+16 |
| | |||||
* | fix tests post-DISTINCT | Bryan Newbold | 2018-05-08 | 5 | -25/+30 |
| | | | | Confirms it's working! | ||||
* | distinct on SHA1 in cdx scripts | Bryan Newbold | 2018-05-08 | 2 | -6/+18 |
| | |||||
* | pig cdx join improvements | Bryan Newbold | 2018-05-08 | 1 | -1/+1 |
| | |||||
* | how to run pig in production | Bryan Newbold | 2018-05-08 | 1 | -0/+5 |
| | |||||
* | WIP on filter-cdx-join-urls.pig | Bryan Newbold | 2018-05-07 | 1 | -0/+37 |
| | |||||
* | Merge branch 'master' of git.archive.org:webgroup/sandcrawler | Bryan Newbold | 2018-05-08 | 8 | -3/+139 |
|\ | |||||
| * | stale TODO | Bryan Newbold | 2018-05-07 | 1 | -0/+1 |
| | | |||||
| * | pig script to filter GWB CDX by SURT regexes | Bryan Newbold | 2018-05-07 | 6 | -0/+127 |
| | | |||||
| * | improve pig helper | Bryan Newbold | 2018-05-07 | 1 | -3/+11 |
| | | |||||
* | | actually fix oversize inserts | Bryan Newbold | 2018-05-08 | 1 | -7/+7 |
|/ | |||||
* | XML size limit | Bryan Newbold | 2018-04-26 | 1 | -0/+6 |
| | |||||
* | force_existing flag for extraction | Bryan Newbold | 2018-04-19 | 1 | -1/+5 |
| | |||||
* | NLineInputFormat requires RawProtocol | Bryan Newbold | 2018-04-19 | 1 | -1/+2 |
| | | | | | Should make this a command line argument or something. Want one in HADOOP, the other for local/tests/inline/etc. | ||||
* | local mrjob config | Bryan Newbold | 2018-04-19 | 1 | -0/+6 |
| | |||||
* | switch to new (local) sentry instance | Bryan Newbold | 2018-04-18 | 1 | -1/+1 |
| | |||||
* | notes on attempted vinay setup | Bryan Newbold | 2018-04-18 | 2 | -1/+9 |
| | |||||
* | start adding macOS instructions | Bryan Newbold | 2018-04-16 | 1 | -0/+4 |
| | |||||
* | update Pipfile.lock (new pluggy) | Bryan Newbold | 2018-04-16 | 1 | -59/+64 |
| | |||||
* | use NLineInputFormat so we can control split size | Bryan Newbold | 2018-04-11 | 1 | -0/+1 |
| | |||||
* | revert PYTHONPATH in cmdenv | Bryan Newbold | 2018-04-11 | 1 | -1/+2 |
| | | | | Seemed to break hadoop jobs for some reason | ||||
* | Merge branch 'bnewbold-sentry' | Bryan Newbold | 2018-04-10 | 4 | -19/+31 |
|\ | |||||
| * | prototype sentry integration | Bryan Newbold | 2018-04-10 | 4 | -19/+31 |
| | | |||||
* | | don't try to decode GROBID output | Bryan Newbold | 2018-04-11 | 1 | -2/+2 |
|/ | |||||
* | partially lint extraction_cdx_grobid.py | Bryan Newbold | 2018-04-10 | 1 | -8/+6 |
| | |||||
* | yet more test improvements | Bryan Newbold | 2018-04-10 | 2 | -9/+61 |
| | |||||
* | cleanup tests; add one for double-processing | Bryan Newbold | 2018-04-10 | 2 | -20/+43 |
| | |||||
* | TODO updates | Bryan Newbold | 2018-04-10 | 3 | -18/+3 |
| | |||||
* | wayback 404 test | Bryan Newbold | 2018-04-10 | 2 | -5/+49 |
| | |||||
* | extraction test fixes | Bryan Newbold | 2018-04-10 | 2 | -27/+50 |
| | |||||
* | grobid2json test fixes | Bryan Newbold | 2018-04-10 | 2 | -1/+3 |
| | |||||
* | failing tests! | Bryan Newbold | 2018-04-10 | 2 | -16/+51 |
| | |||||
* | configs and README updates | Bryan Newbold | 2018-04-07 | 4 | -5/+27 |
| | |||||
* | nits | Bryan Newbold | 2018-04-06 | 2 | -1/+2 |
| | |||||
* | bug fixes | Bryan Newbold | 2018-04-06 | 1 | -7/+14 |
| | |||||
* | updates to running | Bryan Newbold | 2018-04-06 | 1 | -5/+14 |
| | |||||
* | disable pig tests for now | Bryan Newbold | 2018-04-06 | 2 | -7/+10 |
| | |||||
* | try pig env again | Bryan Newbold | 2018-04-06 | 2 | -2/+4 |
| | |||||
* | use IA mirror for pig download | Bryan Newbold | 2018-04-06 | 1 | -1/+2 |
| | |||||
* | lint fixes | Bryan Newbold | 2018-04-06 | 6 | -19/+11 |
| | |||||
* | fetch deps in pig script | Bryan Newbold | 2018-04-06 | 1 | -0/+1 |
| | |||||
* | show coverage | Bryan Newbold | 2018-04-06 | 1 | -1/+1 |
| | |||||
* | renamed do_tei | Bryan Newbold | 2018-04-06 | 1 | -3/+3 |
| |