* small (syntax?) changes to pig join scriptBryan Newbold2020-01-021-2/+2
* pig: first rev of join-cdx-sha1 scriptBryan Newbold2019-12-223-0/+91
* pig: move count_lines helper to pighelper.pyBryan Newbold2019-12-223-7/+6
* new/additional GWB CDX filter scriptsBryan Newbold2019-10-177-0/+142
* add ojs and dspace as in-domain patterns to look for in heuristic CDX PDF filterBryan Newbold2019-04-121-1/+1
* rework fetch_hadoop scriptBryan Newbold2018-08-242-24/+5
* commit old tweak to pig script (from cluster)Bryan Newbold2018-07-061-2/+4
* possibly-broken version of hbase-count-rows.pigBryan Newbold2018-07-061-0/+13
* fix tests post-DISTINCTBryan Newbold2018-05-084-25/+25
* distinct on SHA1 in cdx scriptsBryan Newbold2018-05-082-6/+18
* pig cdx join improvementsBryan Newbold2018-05-081-1/+1
* how to run pig in productionBryan Newbold2018-05-081-0/+5
* WIP on filter-cdx-join-urls.pigBryan Newbold2018-05-071-0/+37
* pig script to filter GWB CDX by SURT regexesBryan Newbold2018-05-076-0/+127
* improve pig helperBryan Newbold2018-05-071-3/+11
* try pig env againBryan Newbold2018-04-061-0/+2
* use IA mirror for pig downloadBryan Newbold2018-04-061-1/+2
* shift docs around a bitBryan Newbold2018-04-031-5/+0
* clean up pig test stuffBryan Newbold2018-03-306-62/+71
* basically working pig testBryan Newbold2018-03-295-23/+32
* progress on pig testsBryan Newbold2018-03-298-10/+127
* import WIP on pig test setupBryan Newbold2018-03-296-0/+156