aboutsummaryrefslogtreecommitdiffstats
path: root/pig/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'pig/README.md')
-rw-r--r--pig/README.md52
1 files changed, 51 insertions, 1 deletions
diff --git a/pig/README.md b/pig/README.md
index 7e59600..048c10e 100644
--- a/pig/README.md
+++ b/pig/README.md
@@ -1 +1,51 @@
-# README
+# Notes
+
+In April 2021, we run pig 0.12 and hadoop 2.6.0-cdh5.14.4.
+
+Pig has a local mode for testing and debugging, `pig -x local script.pig`, only
+pig needs to be installed and `JAVA_HOME` to be set.
+
+Additional jars can be loaded, e.g.
+
+* `/home/webcrawl/pig-scripts/jars/ia-web-commons-jar-with-dependencies-CDH3.jar`
+* `/home/webcrawl/pig-scripts/jars/pigtools.jar`
+
+----
+
+# Previous Notes (BN)
+
+As of March 2018, the archive runs Pig version 0.12.0, via CDH5.0.1 (Cloudera
+Distribution).
+
+"Local mode" unit tests in this folder run with Pig version 0.17.0 (controlled
+by `fetch_deps.sh`) due to [dependency/jar issues][pig-bug] in local mode of
+0.12.0.
+
+[pig-bug]: https://issues.apache.org/jira/browse/PIG-3530
+
+## Development and Testing
+
+To run tests, you need Java installed and `JAVA_HOME` configured.
+
+Fetch dependencies (including pig) from top-level directory:
+
+ ./fetch_hadoop.sh
+
+Write `.pig` scripts in this directory, and add a python wrapper test to
+`./tests/` when done. Test vector files (input/output) can go in
+`./tests/files/`.
+
+Run the tests with:
+
+ pipenv run pytest
+
+Could also, in theory, use a docker image ([local-pig][]), but it's pretty easy
+to just download.
+
+[local-pig]: https://hub.docker.com/r/chalimartines/local-pig
+
+## Run in Production
+
+ pig -param INPUT="/user/bnewbold/pdfs/global-20171227034923" \
+ -param OUTPUT="/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter" \
+ filter-cdx-paper-pdfs.pig