# Notes In April 2021, we run pig 0.12 and hadoop 2.6.0-cdh5.14.4. Pig has a local mode for testing and debugging, `pig -x local script.pig`, only pig needs to be installed and `JAVA_HOME` to be set. Additional jars can be loaded, e.g. * `/home/webcrawl/pig-scripts/jars/ia-web-commons-jar-with-dependencies-CDH3.jar` * `/home/webcrawl/pig-scripts/jars/pigtools.jar` Ops: Cluster load high until end of 04/2021; putting lookups on hold. ---- # Previous Notes (BN) As of March 2018, the archive runs Pig version 0.12.0, via CDH5.0.1 (Cloudera Distribution). "Local mode" unit tests in this folder run with Pig version 0.17.0 (controlled by `fetch_deps.sh`) due to [dependency/jar issues][pig-bug] in local mode of 0.12.0. [pig-bug]: https://issues.apache.org/jira/browse/PIG-3530 ## Development and Testing To run tests, you need Java installed and `JAVA_HOME` configured. Fetch dependencies (including pig) from top-level directory: ./fetch_hadoop.sh Write `.pig` scripts in this directory, and add a python wrapper test to `./tests/` when done. Test vector files (input/output) can go in `./tests/files/`. Run the tests with: pipenv run pytest Could also, in theory, use a docker image ([local-pig][]), but it's pretty easy to just download. [local-pig]: https://hub.docker.com/r/chalimartines/local-pig ## Run in Production pig -param INPUT="/user/bnewbold/pdfs/global-20171227034923" \ -param OUTPUT="/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter" \ filter-cdx-paper-pdfs.pig