diff options
Diffstat (limited to 'pig')
-rw-r--r-- | pig/README.md | 19 |
1 files changed, 19 insertions, 0 deletions
diff --git a/pig/README.md b/pig/README.md index 6d41658..5281169 100644 --- a/pig/README.md +++ b/pig/README.md @@ -12,6 +12,25 @@ Additional jars can be loaded, e.g. Ops: Cluster load high until end of 04/2021; putting lookups on hold. +# 05/2021 Run + +``` +$ source /home/webcrawl/hadoop-env/prod/setup-env.sh +$ zstdcat -T0 date-2021-05-06.tsv.zst | LC_ALL=C grep ^http > fatcat-refs-urllist-2021-05-06.tsv +$ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s +``` + +A test run with a single file. + +``` +$ pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-0 +``` + +* ia802401.us.archive.org:6988/ +* http://ia802401.us.archive.org:6988/cluster/app/application_1611217683160_298042 +* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_298042/ + + ---- # Previous Notes (BN) |