pig/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

# Notes

In April 2021, we run pig 0.12 and hadoop 2.6.0-cdh5.14.4.

Pig has a local mode for testing and debugging, `pig -x local script.pig`, only
pig needs to be installed and `JAVA_HOME` to be set.

Additional jars can be loaded, e.g.

* `/home/webcrawl/pig-scripts/jars/ia-web-commons-jar-with-dependencies-CDH3.jar`
* `/home/webcrawl/pig-scripts/jars/pigtools.jar`

Ops: Cluster load high until end of 04/2021; putting lookups on hold.

# 05/2021 Run

```
$ source /home/webcrawl/hadoop-env/prod/setup-env.sh
$ zstdcat -T0 date-2021-05-06.tsv.zst | LC_ALL=C grep ^http > fatcat-refs-urllist-2021-05-06.tsv
$ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s
```

A test run with a single file.

```
$ pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-0
```

* http://ia802401.us.archive.org:6988/
* http://ia802401.us.archive.org:6988/cluster/app/application_1611217683160_298042
* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_298042/

Running against 1/300 block of global CDX took about 15h.

```
2021-05-18 09:15:24,941 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2021-05-18 09:15:30,959 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

real    882m52.192s
user    44m25.048s
sys     5m51.749s
```

How many links? On live web?

```
$ gohdfs cat /user/martin/fatcat-refs-lookup-0/part-r-00000 | awk '{ print $3 }' > refs_links_testrun.tsv
$ time cat refs_links_testrun.tsv | clinker -w 128 -verbose > refs_links_liveweb.json
$ wc -l refs_links_liveweb.json
2623 refs_links_liveweb.json
$ jq -rc .status refs_links_liveweb.json | sort | uniq -c | sort -nr 2> /dev/null
   2252 200
    266 403
    154 404
     10 null
```

Running against a full index.

```
$ time pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210321055100/part-a-*.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-1 filter-cdx-join-urls.pig
```

* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_300026/


----

# Previous Notes (BN)

As of March 2018, the archive runs Pig version 0.12.0, via CDH5.0.1 (Cloudera
Distribution).

"Local mode" unit tests in this folder run with Pig version 0.17.0 (controlled
by `fetch_deps.sh`) due to [dependency/jar issues][pig-bug] in local mode of
0.12.0.

[pig-bug]: https://issues.apache.org/jira/browse/PIG-3530

## Development and Testing

To run tests, you need Java installed and `JAVA_HOME` configured.

Fetch dependencies (including pig) from top-level directory:

    ./fetch_hadoop.sh

Write `.pig` scripts in this directory, and add a python wrapper test to
`./tests/` when done.  Test vector files (input/output) can go in
`./tests/files/`.

Run the tests with:

    pipenv run pytest

Could also, in theory, use a docker image ([local-pig][]), but it's pretty easy
to just download.

[local-pig]: https://hub.docker.com/r/chalimartines/local-pig

## Run in Production

    pig -param INPUT="/user/bnewbold/pdfs/global-20171227034923" \
        -param OUTPUT="/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter" \
        filter-cdx-paper-pdfs.pig