aboutsummaryrefslogtreecommitdiffstats
path: root/pig/README.md
blob: df8ce68a0a75b230013a3715658363f2b7f300fa (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
As of March 2018, the archive runs Pig version 0.12.0, via CDH5.0.1 (Cloudera
Distribution).

"Local mode" unit tests in this folder run with Pig version 0.17.0 (controlled
by `fetch_deps.sh`) due to [dependency/jar issues][pig-bug] in local mode of
0.12.0.

[pig-bug]: https://issues.apache.org/jira/browse/PIG-3530

## Development and Testing

To run tests, you need Java installed and `JAVA_HOME` configured.

Fetch dependencies (including pig) from top-level directory:

    ./fetch_hadoop.sh

Write `.pig` scripts in this directory, and add a python wrapper test to
`./tests/` when done.  Test vector files (input/output) can go in
`./tests/files/`.

Run the tests with:

    pipenv run pytest

Could also, in theory, use a docker image ([local-pig][]), but it's pretty easy
to just download.

[local-pig]: https://hub.docker.com/r/chalimartines/local-pig

## Run in Production

    pig -param INPUT="/user/bnewbold/pdfs/global-20171227034923" \
        -param OUTPUT="/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter" \
        filter-cdx-paper-pdfs.pig