Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | distinct on SHA1 in cdx scripts | Bryan Newbold | 2018-05-08 | 2 | -6/+18 | |
| | ||||||
* | pig cdx join improvements | Bryan Newbold | 2018-05-08 | 1 | -1/+1 | |
| | ||||||
* | how to run pig in production | Bryan Newbold | 2018-05-08 | 1 | -0/+5 | |
| | ||||||
* | WIP on filter-cdx-join-urls.pig | Bryan Newbold | 2018-05-07 | 1 | -0/+37 | |
| | ||||||
* | Merge branch 'master' of git.archive.org:webgroup/sandcrawler | Bryan Newbold | 2018-05-08 | 8 | -3/+139 | |
|\ | ||||||
| * | stale TODO | Bryan Newbold | 2018-05-07 | 1 | -0/+1 | |
| | | ||||||
| * | pig script to filter GWB CDX by SURT regexes | Bryan Newbold | 2018-05-07 | 6 | -0/+127 | |
| | | ||||||
| * | improve pig helper | Bryan Newbold | 2018-05-07 | 1 | -3/+11 | |
| | | ||||||
* | | actually fix oversize inserts | Bryan Newbold | 2018-05-08 | 1 | -7/+7 | |
|/ | ||||||
* | XML size limit | Bryan Newbold | 2018-04-26 | 1 | -0/+6 | |
| | ||||||
* | force_existing flag for extraction | Bryan Newbold | 2018-04-19 | 1 | -1/+5 | |
| | ||||||
* | NLineInputFormat requires RawProtocol | Bryan Newbold | 2018-04-19 | 1 | -1/+2 | |
| | | | | | Should make this a command line argument or something. Want one in HADOOP, the other for local/tests/inline/etc. | |||||
* | local mrjob config | Bryan Newbold | 2018-04-19 | 1 | -0/+6 | |
| | ||||||
* | switch to new (local) sentry instance | Bryan Newbold | 2018-04-18 | 1 | -1/+1 | |
| | ||||||
* | notes on attempted vinay setup | Bryan Newbold | 2018-04-18 | 2 | -1/+9 | |
| | ||||||
* | start adding macOS instructions | Bryan Newbold | 2018-04-16 | 1 | -0/+4 | |
| | ||||||
* | update Pipfile.lock (new pluggy) | Bryan Newbold | 2018-04-16 | 1 | -59/+64 | |
| | ||||||
* | use NLineInputFormat so we can control split size | Bryan Newbold | 2018-04-11 | 1 | -0/+1 | |
| | ||||||
* | revert PYTHONPATH in cmdenv | Bryan Newbold | 2018-04-11 | 1 | -1/+2 | |
| | | | | Seemed to break hadoop jobs for some reason | |||||
* | Merge branch 'bnewbold-sentry' | Bryan Newbold | 2018-04-10 | 4 | -19/+31 | |
|\ | ||||||
| * | prototype sentry integration | Bryan Newbold | 2018-04-10 | 4 | -19/+31 | |
| | | ||||||
* | | don't try to decode GROBID output | Bryan Newbold | 2018-04-11 | 1 | -2/+2 | |
|/ | ||||||
* | partially lint extraction_cdx_grobid.py | Bryan Newbold | 2018-04-10 | 1 | -8/+6 | |
| | ||||||
* | yet more test improvements | Bryan Newbold | 2018-04-10 | 2 | -9/+61 | |
| | ||||||
* | cleanup tests; add one for double-processing | Bryan Newbold | 2018-04-10 | 2 | -20/+43 | |
| | ||||||
* | TODO updates | Bryan Newbold | 2018-04-10 | 3 | -18/+3 | |
| | ||||||
* | wayback 404 test | Bryan Newbold | 2018-04-10 | 2 | -5/+49 | |
| | ||||||
* | extraction test fixes | Bryan Newbold | 2018-04-10 | 2 | -27/+50 | |
| | ||||||
* | grobid2json test fixes | Bryan Newbold | 2018-04-10 | 2 | -1/+3 | |
| | ||||||
* | failing tests! | Bryan Newbold | 2018-04-10 | 2 | -16/+51 | |
| | ||||||
* | configs and README updates | Bryan Newbold | 2018-04-07 | 4 | -5/+27 | |
| | ||||||
* | nits | Bryan Newbold | 2018-04-06 | 2 | -1/+2 | |
| | ||||||
* | bug fixes | Bryan Newbold | 2018-04-06 | 1 | -7/+14 | |
| | ||||||
* | updates to running | Bryan Newbold | 2018-04-06 | 1 | -5/+14 | |
| | ||||||
* | disable pig tests for now | Bryan Newbold | 2018-04-06 | 2 | -7/+10 | |
| | ||||||
* | try pig env again | Bryan Newbold | 2018-04-06 | 2 | -2/+4 | |
| | ||||||
* | use IA mirror for pig download | Bryan Newbold | 2018-04-06 | 1 | -1/+2 | |
| | ||||||
* | lint fixes | Bryan Newbold | 2018-04-06 | 6 | -19/+11 | |
| | ||||||
* | fetch deps in pig script | Bryan Newbold | 2018-04-06 | 1 | -0/+1 | |
| | ||||||
* | show coverage | Bryan Newbold | 2018-04-06 | 1 | -1/+1 | |
| | ||||||
* | renamed do_tei | Bryan Newbold | 2018-04-06 | 1 | -3/+3 | |
| | ||||||
* | switch to newer test image | Bryan Newbold | 2018-04-06 | 1 | -1/+1 | |
| | ||||||
* | temporarily skip pylint on extraction | Bryan Newbold | 2018-04-06 | 1 | -0/+3 | |
| | ||||||
* | add pylint to CI | Bryan Newbold | 2018-04-06 | 5 | -41/+123 | |
| | ||||||
* | iterate gitlab-ci.yml | Bryan Newbold | 2018-04-06 | 1 | -3/+5 | |
| | ||||||
* | add test for grobid2json | Bryan Newbold | 2018-04-06 | 1 | -0/+14 | |
| | ||||||
* | coverage defaults | Bryan Newbold | 2018-04-06 | 1 | -0/+3 | |
| | ||||||
* | gitlab test script | Bryan Newbold | 2018-04-06 | 2 | -2/+20 | |
| | ||||||
* | small grobid2json test | Bryan Newbold | 2018-04-06 | 4 | -2/+164 | |
| | ||||||
* | make happybase mock injection slightly less horrible | Bryan Newbold | 2018-04-05 | 4 | -36/+31 | |
| |