aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* more deps locationsBryan Newbold2018-05-211-0/+8
|
* gitignore for scalding directoryBryan Newbold2018-05-211-0/+3
|
* fix WordCountJob package; tests; hadoop versionBryan Newbold2018-05-213-2/+43
| | | | | | | | | | | When copying from upstream scalding, forgot to change the path/namespace of the WordCountJob. Production IA cluster is actually running Hadoop 2.5, not 2.6 (I keep forgetting). Pull in more dependencies so test runs (copied from scalding repo, only changed the namespace of the job)
* WordCount -> WordCountJobBryan Newbold2018-05-213-13/+13
| | | | Also use the exact file from scalding repo
* success running with com.twitter.scalding.ToolBryan Newbold2018-05-212-4/+11
|
* remove main function; class name same as fileBryan Newbold2018-05-211-12/+1
|
* copy in jvm ecosystem notesBryan Newbold2018-05-211-0/+46
|
* copy in scalding learning exampleBryan Newbold2018-05-216-0/+93
|
* jvm/scala/scalding setup notesBryan Newbold2018-05-171-0/+16
|
* fix tests post-DISTINCTBryan Newbold2018-05-085-25/+30
| | | | Confirms it's working!
* distinct on SHA1 in cdx scriptsBryan Newbold2018-05-082-6/+18
|
* pig cdx join improvementsBryan Newbold2018-05-081-1/+1
|
* how to run pig in productionBryan Newbold2018-05-081-0/+5
|
* WIP on filter-cdx-join-urls.pigBryan Newbold2018-05-071-0/+37
|
* Merge branch 'master' of git.archive.org:webgroup/sandcrawlerBryan Newbold2018-05-088-3/+139
|\
| * stale TODOBryan Newbold2018-05-071-0/+1
| |
| * pig script to filter GWB CDX by SURT regexesBryan Newbold2018-05-076-0/+127
| |
| * improve pig helperBryan Newbold2018-05-071-3/+11
| |
* | actually fix oversize insertsBryan Newbold2018-05-081-7/+7
|/
* XML size limitBryan Newbold2018-04-261-0/+6
|
* force_existing flag for extractionBryan Newbold2018-04-191-1/+5
|
* NLineInputFormat requires RawProtocolBryan Newbold2018-04-191-1/+2
| | | | | Should make this a command line argument or something. Want one in HADOOP, the other for local/tests/inline/etc.
* local mrjob configBryan Newbold2018-04-191-0/+6
|
* switch to new (local) sentry instanceBryan Newbold2018-04-181-1/+1
|
* notes on attempted vinay setupBryan Newbold2018-04-182-1/+9
|
* start adding macOS instructionsBryan Newbold2018-04-161-0/+4
|
* update Pipfile.lock (new pluggy)Bryan Newbold2018-04-161-59/+64
|
* use NLineInputFormat so we can control split sizeBryan Newbold2018-04-111-0/+1
|
* revert PYTHONPATH in cmdenvBryan Newbold2018-04-111-1/+2
| | | | Seemed to break hadoop jobs for some reason
* Merge branch 'bnewbold-sentry'Bryan Newbold2018-04-104-19/+31
|\
| * prototype sentry integrationBryan Newbold2018-04-104-19/+31
| |
* | don't try to decode GROBID outputBryan Newbold2018-04-111-2/+2
|/
* partially lint extraction_cdx_grobid.pyBryan Newbold2018-04-101-8/+6
|
* yet more test improvementsBryan Newbold2018-04-102-9/+61
|
* cleanup tests; add one for double-processingBryan Newbold2018-04-102-20/+43
|
* TODO updatesBryan Newbold2018-04-103-18/+3
|
* wayback 404 testBryan Newbold2018-04-102-5/+49
|
* extraction test fixesBryan Newbold2018-04-102-27/+50
|
* grobid2json test fixesBryan Newbold2018-04-102-1/+3
|
* failing tests!Bryan Newbold2018-04-102-16/+51
|
* configs and README updatesBryan Newbold2018-04-074-5/+27
|
* nitsBryan Newbold2018-04-062-1/+2
|
* bug fixesBryan Newbold2018-04-061-7/+14
|
* updates to runningBryan Newbold2018-04-061-5/+14
|
* disable pig tests for nowBryan Newbold2018-04-062-7/+10
|
* try pig env againBryan Newbold2018-04-062-2/+4
|
* use IA mirror for pig downloadBryan Newbold2018-04-061-1/+2
|
* lint fixesBryan Newbold2018-04-066-19/+11
|
* fetch deps in pig scriptBryan Newbold2018-04-061-0/+1
|
* show coverageBryan Newbold2018-04-061-1/+1
|