aboutsummaryrefslogtreecommitdiffstats
path: root/pig
Commit message (Collapse)AuthorAgeFilesLines
* small (syntax?) changes to pig join scriptBryan Newbold2020-01-021-2/+2
|
* pig: first rev of join-cdx-sha1 scriptBryan Newbold2019-12-223-0/+91
|
* pig: move count_lines helper to pighelper.pyBryan Newbold2019-12-223-7/+6
|
* new/additional GWB CDX filter scriptsBryan Newbold2019-10-177-0/+142
|
* add ojs and dspace as in-domain patterns to look for in heuristic CDX PDF filterBryan Newbold2019-04-121-1/+1
|
* rework fetch_hadoop scriptBryan Newbold2018-08-242-24/+5
| | | | | Should work on macOS now, and fetches hadoop in addition to pig. Still requires wget (not installed by default on macOS).
* commit old tweak to pig script (from cluster)Bryan Newbold2018-07-061-2/+4
|
* possibly-broken version of hbase-count-rows.pigBryan Newbold2018-07-061-0/+13
| | | | | | This just worked a minute ago, but now throws: org.apache.hadoop.hbase.DoNotRetryIOException: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/util/ByteStringer
* fix tests post-DISTINCTBryan Newbold2018-05-084-25/+25
| | | | Confirms it's working!
* distinct on SHA1 in cdx scriptsBryan Newbold2018-05-082-6/+18
|
* pig cdx join improvementsBryan Newbold2018-05-081-1/+1
|
* how to run pig in productionBryan Newbold2018-05-081-0/+5
|
* WIP on filter-cdx-join-urls.pigBryan Newbold2018-05-071-0/+37
|
* pig script to filter GWB CDX by SURT regexesBryan Newbold2018-05-076-0/+127
|
* improve pig helperBryan Newbold2018-05-071-3/+11
|
* try pig env againBryan Newbold2018-04-061-0/+2
|
* use IA mirror for pig downloadBryan Newbold2018-04-061-1/+2
|
* shift docs around a bitBryan Newbold2018-04-031-5/+0
|
* clean up pig test stuffBryan Newbold2018-03-306-62/+71
|
* basically working pig testBryan Newbold2018-03-295-23/+32
|
* progress on pig testsBryan Newbold2018-03-298-10/+127
|
* import WIP on pig test setupBryan Newbold2018-03-296-0/+156