1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
|
# Notes
In April 2021, we run pig 0.12 and hadoop 2.6.0-cdh5.14.4.
Pig has a local mode for testing and debugging, `pig -x local script.pig`, only
pig needs to be installed and `JAVA_HOME` to be set.
Additional jars can be loaded, e.g.
* `/home/webcrawl/pig-scripts/jars/ia-web-commons-jar-with-dependencies-CDH3.jar`
* `/home/webcrawl/pig-scripts/jars/pigtools.jar`
Ops: Cluster load high until end of 04/2021; putting lookups on hold.
# 05/2021 Run
```
$ source /home/webcrawl/hadoop-env/prod/setup-env.sh
$ zstdcat -T0 date-2021-05-06.tsv.zst | LC_ALL=C grep ^http > fatcat-refs-urllist-2021-05-06.tsv
$ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s
```
A test run with a single file.
```
$ pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-0
```
* http://ia802401.us.archive.org:6988/
* http://ia802401.us.archive.org:6988/cluster/app/application_1611217683160_298042
* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_298042/
Running against 1/300 block of global CDX took about 15h.
```
2021-05-18 09:15:24,941 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2021-05-18 09:15:30,959 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
real 882m52.192s
user 44m25.048s
sys 5m51.749s
```
How many links? On live web?
```
$ gohdfs cat /user/martin/fatcat-refs-lookup-0/part-r-00000 | awk '{ print $3 }' > refs_links_testrun.tsv
$ time cat refs_links_testrun.tsv | clinker -w 128 -verbose > refs_links_liveweb.json
$ wc -l refs_links_liveweb.json
2623 refs_links_liveweb.json
$ jq -rc .status refs_links_liveweb.json | sort | uniq -c | sort -nr 2> /dev/null
2252 200
266 403
154 404
10 null
```
Running against a full index.
```
$ time pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210321055100/part-a-*.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-1 filter-cdx-join-urls.pig
```
* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_300026/
----
# Previous Notes (BN)
As of March 2018, the archive runs Pig version 0.12.0, via CDH5.0.1 (Cloudera
Distribution).
"Local mode" unit tests in this folder run with Pig version 0.17.0 (controlled
by `fetch_deps.sh`) due to [dependency/jar issues][pig-bug] in local mode of
0.12.0.
[pig-bug]: https://issues.apache.org/jira/browse/PIG-3530
## Development and Testing
To run tests, you need Java installed and `JAVA_HOME` configured.
Fetch dependencies (including pig) from top-level directory:
./fetch_hadoop.sh
Write `.pig` scripts in this directory, and add a python wrapper test to
`./tests/` when done. Test vector files (input/output) can go in
`./tests/files/`.
Run the tests with:
pipenv run pytest
Could also, in theory, use a docker image ([local-pig][]), but it's pretty easy
to just download.
[local-pig]: https://hub.docker.com/r/chalimartines/local-pig
## Run in Production
pig -param INPUT="/user/bnewbold/pdfs/global-20171227034923" \
-param OUTPUT="/user/bnewbold/pdfs/gwb-pdf-20171227034923-surt-filter" \
filter-cdx-paper-pdfs.pig
|