pig/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

# Notes

In April 2021, we run pig 0.12 and hadoop 2.6.0-cdh5.14.4.

Pig has a local mode for testing and debugging, `pig -x local script.pig`, only
pig needs to be installed and `JAVA_HOME` to be set.

Additional jars can be loaded, e.g.

* `/home/webcrawl/pig-scripts/jars/ia-web-commons-jar-with-dependencies-CDH3.jar`
* `/home/webcrawl/pig-scripts/jars/pigtools.jar`

Ops: Cluster load high until end of 04/2021; putting lookups on hold.

----

# Previous Notes (BN)

As of March 2018, the archive runs Pig version 0.12.0, via CDH5.0.1 (Cloudera
Distribution).

"Local mode" unit tests in this folder run with Pig version 0.17.0 (controlled
by `fetch_deps.sh`) due to [dependency/jar issues][pig-bug] in local mode of
0.12.0.

[pig-bug]: https://issues.apache.org/jira/browse/PIG-3530

## Development and Testing

To run tests, you need Java installed and `JAVA_HOME` configured.

Fetch dependencies (including pig) from top-level directory:

    ./fetch_hadoop.sh

Write `.pig` scripts in this directory, and add a python wrapper test to
`./tests/` when done.  Test vector files (input/output) can go in
`./tests/files/`.

Run the tests with:

    pipenv run pytest

Could also, in theory, use a docker image ([local-pig][]), but it's pretty easy
to just download.

[local-pig]: https://hub.docker.com/r/chalimartines/local-pig

## Run in Production

    pig -param INPUT="/user/bnewbold/pdfs/global-20171227034923" \
        -param OUTPUT="/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter" \
        filter-cdx-paper-pdfs.pig