aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* WordCount -> WordCountJobBryan Newbold2018-05-213-13/+13
| | | | Also use the exact file from scalding repo
* success running with com.twitter.scalding.ToolBryan Newbold2018-05-212-4/+11
|
* remove main function; class name same as fileBryan Newbold2018-05-211-12/+1
|
* copy in jvm ecosystem notesBryan Newbold2018-05-211-0/+46
|
* copy in scalding learning exampleBryan Newbold2018-05-216-0/+93
|
* jvm/scala/scalding setup notesBryan Newbold2018-05-171-0/+16
|
* fix tests post-DISTINCTBryan Newbold2018-05-085-25/+30
| | | | Confirms it's working!
* distinct on SHA1 in cdx scriptsBryan Newbold2018-05-082-6/+18
|
* pig cdx join improvementsBryan Newbold2018-05-081-1/+1
|
* how to run pig in productionBryan Newbold2018-05-081-0/+5
|
* WIP on filter-cdx-join-urls.pigBryan Newbold2018-05-071-0/+37
|
* Merge branch 'master' of git.archive.org:webgroup/sandcrawlerBryan Newbold2018-05-088-3/+139
|\
| * stale TODOBryan Newbold2018-05-071-0/+1
| |
| * pig script to filter GWB CDX by SURT regexesBryan Newbold2018-05-076-0/+127
| |
| * improve pig helperBryan Newbold2018-05-071-3/+11
| |
* | actually fix oversize insertsBryan Newbold2018-05-081-7/+7
|/
* XML size limitBryan Newbold2018-04-261-0/+6
|
* force_existing flag for extractionBryan Newbold2018-04-191-1/+5
|
* NLineInputFormat requires RawProtocolBryan Newbold2018-04-191-1/+2
| | | | | Should make this a command line argument or something. Want one in HADOOP, the other for local/tests/inline/etc.
* local mrjob configBryan Newbold2018-04-191-0/+6
|
* switch to new (local) sentry instanceBryan Newbold2018-04-181-1/+1
|
* notes on attempted vinay setupBryan Newbold2018-04-182-1/+9
|
* start adding macOS instructionsBryan Newbold2018-04-161-0/+4
|
* update Pipfile.lock (new pluggy)Bryan Newbold2018-04-161-59/+64
|
* use NLineInputFormat so we can control split sizeBryan Newbold2018-04-111-0/+1
|
* revert PYTHONPATH in cmdenvBryan Newbold2018-04-111-1/+2
| | | | Seemed to break hadoop jobs for some reason
* Merge branch 'bnewbold-sentry'Bryan Newbold2018-04-104-19/+31
|\
| * prototype sentry integrationBryan Newbold2018-04-104-19/+31
| |
* | don't try to decode GROBID outputBryan Newbold2018-04-111-2/+2
|/
* partially lint extraction_cdx_grobid.pyBryan Newbold2018-04-101-8/+6
|
* yet more test improvementsBryan Newbold2018-04-102-9/+61
|
* cleanup tests; add one for double-processingBryan Newbold2018-04-102-20/+43
|
* TODO updatesBryan Newbold2018-04-103-18/+3
|
* wayback 404 testBryan Newbold2018-04-102-5/+49
|
* extraction test fixesBryan Newbold2018-04-102-27/+50
|
* grobid2json test fixesBryan Newbold2018-04-102-1/+3
|
* failing tests!Bryan Newbold2018-04-102-16/+51
|
* configs and README updatesBryan Newbold2018-04-074-5/+27
|
* nitsBryan Newbold2018-04-062-1/+2
|
* bug fixesBryan Newbold2018-04-061-7/+14
|
* updates to runningBryan Newbold2018-04-061-5/+14
|
* disable pig tests for nowBryan Newbold2018-04-062-7/+10
|
* try pig env againBryan Newbold2018-04-062-2/+4
|
* use IA mirror for pig downloadBryan Newbold2018-04-061-1/+2
|
* lint fixesBryan Newbold2018-04-066-19/+11
|
* fetch deps in pig scriptBryan Newbold2018-04-061-0/+1
|
* show coverageBryan Newbold2018-04-061-1/+1
|
* renamed do_teiBryan Newbold2018-04-061-3/+3
|
* switch to newer test imageBryan Newbold2018-04-061-1/+1
|
* temporarily skip pylint on extractionBryan Newbold2018-04-061-0/+3
|