diff options
Diffstat (limited to 'pig/README.md')
-rw-r--r-- | pig/README.md | 23 |
1 files changed, 15 insertions, 8 deletions
diff --git a/pig/README.md b/pig/README.md index d13c065..bc892fa 100644 --- a/pig/README.md +++ b/pig/README.md @@ -1,4 +1,4 @@ -# Notes +# Pig Notes In April 2021, we run pig 0.12 and hadoop 2.6.0-cdh5.14.4. @@ -23,13 +23,13 @@ $ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s A test run with a single file. ``` -$ pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-0 +$ pig \ + -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz \ + -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv \ + -p OUTPUT=/user/martin/fatcat-refs-lookup-0 \ + filter-cdx-join-urls.pig ``` -* http://ia802401.us.archive.org:6988/ -* http://ia802401.us.archive.org:6988/cluster/app/application_1611217683160_298042 -* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_298042/ - Running against 1/300 block of global CDX took about 15h. ``` @@ -58,11 +58,18 @@ $ jq -rc .status refs_links_liveweb.json | sort | uniq -c | sort -nr 2> /dev/nul Running against a full index. ``` -$ time pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210321055100/part-a-*.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-1 filter-cdx-join-urls.pig +$ time pig \ + -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210321055100/part-a-*.gz \ + -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv \ + -p OUTPUT=/user/martin/fatcat-refs-lookup-1 \ + filter-cdx-join-urls.pig ``` -* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_300026/ +* application id was: `application_1611217683160_300026` +The full lookup led to "map spill", since we needed to extract surts from the +full CDX index. Not taking advantage of zipnum and other possible improvements. +Killed the jobs; required hdfs cleanup. ---- |