From bb0e9b312f8248e882f8650897966ff57117aa17 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Mon, 17 May 2021 21:05:06 +0200 Subject: pig: notes --- pig/README.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/pig/README.md b/pig/README.md index 6d41658..5281169 100644 --- a/pig/README.md +++ b/pig/README.md @@ -12,6 +12,25 @@ Additional jars can be loaded, e.g. Ops: Cluster load high until end of 04/2021; putting lookups on hold. +# 05/2021 Run + +``` +$ source /home/webcrawl/hadoop-env/prod/setup-env.sh +$ zstdcat -T0 date-2021-05-06.tsv.zst | LC_ALL=C grep ^http > fatcat-refs-urllist-2021-05-06.tsv +$ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s +``` + +A test run with a single file. + +``` +$ pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-0 +``` + +* ia802401.us.archive.org:6988/ +* http://ia802401.us.archive.org:6988/cluster/app/application_1611217683160_298042 +* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_298042/ + + ---- # Previous Notes (BN) -- cgit v1.2.3