# Pig Notes In April 2021, we run pig 0.12 and hadoop 2.6.0-cdh5.14.4. Pig has a local mode for testing and debugging, `pig -x local script.pig`, only pig needs to be installed and `JAVA_HOME` to be set. Additional jars can be loaded, e.g. * `/home/webcrawl/pig-scripts/jars/ia-web-commons-jar-with-dependencies-CDH3.jar` * `/home/webcrawl/pig-scripts/jars/pigtools.jar` Ops: Cluster load high until end of 04/2021; putting lookups on hold. # 05/2021 Run ``` $ source /home/webcrawl/hadoop-env/prod/setup-env.sh $ zstdcat -T0 date-2021-05-06.tsv.zst | LC_ALL=C grep ^http > fatcat-refs-urllist-2021-05-06.tsv $ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s ``` A test run with a single file. ``` $ pig \ -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz \ -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv \ -p OUTPUT=/user/martin/fatcat-refs-lookup-0 \ filter-cdx-join-urls.pig ``` Running against 1/300 block of global CDX took about 15h. ``` 2021-05-18 09:15:24,941 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 2021-05-18 09:15:30,959 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! real 882m52.192s user 44m25.048s sys 5m51.749s ``` How many links? On live web? ``` $ gohdfs cat /user/martin/fatcat-refs-lookup-0/part-r-00000 | awk '{ print $3 }' > refs_links_testrun.tsv $ time cat refs_links_testrun.tsv | clinker -w 128 -verbose > refs_links_liveweb.json $ wc -l refs_links_liveweb.json 2623 refs_links_liveweb.json $ jq -rc .status refs_links_liveweb.json | sort | uniq -c | sort -nr 2> /dev/null 2252 200 266 403 154 404 10 null ``` Running against a full index. ``` $ time pig \ -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210321055100/part-a-*.gz \ -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv \ -p OUTPUT=/user/martin/fatcat-refs-lookup-1 \ filter-cdx-join-urls.pig ``` * application id was: `application_1611217683160_300026` The full lookup led to "map spill", since we needed to extract surts from the full CDX index. Not taking advantage of zipnum and other possible improvements. Killed the jobs; required hdfs cleanup. ---- # Previous Notes (BN) As of March 2018, the archive runs Pig version 0.12.0, via CDH5.0.1 (Cloudera Distribution). "Local mode" unit tests in this folder run with Pig version 0.17.0 (controlled by `fetch_deps.sh`) due to [dependency/jar issues][pig-bug] in local mode of 0.12.0. [pig-bug]: https://issues.apache.org/jira/browse/PIG-3530 ## Development and Testing To run tests, you need Java installed and `JAVA_HOME` configured. Fetch dependencies (including pig) from top-level directory: ./fetch_hadoop.sh Write `.pig` scripts in this directory, and add a python wrapper test to `./tests/` when done. Test vector files (input/output) can go in `./tests/files/`. Run the tests with: pipenv run pytest Could also, in theory, use a docker image ([local-pig][]), but it's pretty easy to just download. [local-pig]: https://hub.docker.com/r/chalimartines/local-pig ## Run in Production pig -param INPUT="/user/bnewbold/pdfs/global-20171227034923" \ -param OUTPUT="/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter" \ filter-cdx-paper-pdfs.pig