| Commit message (Expand) | Author | Age | Files | Lines |
* | small (syntax?) changes to pig join script | Bryan Newbold | 2020-01-02 | 1 | -2/+2 |
* | pig: first rev of join-cdx-sha1 script | Bryan Newbold | 2019-12-22 | 3 | -0/+91 |
* | pig: move count_lines helper to pighelper.py | Bryan Newbold | 2019-12-22 | 3 | -7/+6 |
* | new/additional GWB CDX filter scripts | Bryan Newbold | 2019-10-17 | 7 | -0/+142 |
* | add ojs and dspace as in-domain patterns to look for in heuristic CDX PDF filter | Bryan Newbold | 2019-04-12 | 1 | -1/+1 |
* | rework fetch_hadoop script | Bryan Newbold | 2018-08-24 | 2 | -24/+5 |
* | commit old tweak to pig script (from cluster) | Bryan Newbold | 2018-07-06 | 1 | -2/+4 |
* | possibly-broken version of hbase-count-rows.pig | Bryan Newbold | 2018-07-06 | 1 | -0/+13 |
* | fix tests post-DISTINCT | Bryan Newbold | 2018-05-08 | 4 | -25/+25 |
* | distinct on SHA1 in cdx scripts | Bryan Newbold | 2018-05-08 | 2 | -6/+18 |
* | pig cdx join improvements | Bryan Newbold | 2018-05-08 | 1 | -1/+1 |
* | how to run pig in production | Bryan Newbold | 2018-05-08 | 1 | -0/+5 |
* | WIP on filter-cdx-join-urls.pig | Bryan Newbold | 2018-05-07 | 1 | -0/+37 |
* | pig script to filter GWB CDX by SURT regexes | Bryan Newbold | 2018-05-07 | 6 | -0/+127 |
* | improve pig helper | Bryan Newbold | 2018-05-07 | 1 | -3/+11 |
* | try pig env again | Bryan Newbold | 2018-04-06 | 1 | -0/+2 |
* | use IA mirror for pig download | Bryan Newbold | 2018-04-06 | 1 | -1/+2 |
* | shift docs around a bit | Bryan Newbold | 2018-04-03 | 1 | -5/+0 |
* | clean up pig test stuff | Bryan Newbold | 2018-03-30 | 6 | -62/+71 |
* | basically working pig test | Bryan Newbold | 2018-03-29 | 5 | -23/+32 |
* | progress on pig tests | Bryan Newbold | 2018-03-29 | 8 | -10/+127 |
* | import WIP on pig test setup | Bryan Newbold | 2018-03-29 | 6 | -0/+156 |