diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-05-17 21:05:06 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-05-17 21:05:06 +0200 |
commit | bb0e9b312f8248e882f8650897966ff57117aa17 (patch) | |
tree | d4278d948712a1da65853d25c6d8786280341325 | |
parent | 7e3a9fdb956f6c75ac281dca637cb862edd91ae6 (diff) | |
download | refcat-bb0e9b312f8248e882f8650897966ff57117aa17.tar.gz refcat-bb0e9b312f8248e882f8650897966ff57117aa17.zip |
pig: notes
-rw-r--r-- | pig/README.md | 19 |
1 files changed, 19 insertions, 0 deletions
diff --git a/pig/README.md b/pig/README.md index 6d41658..5281169 100644 --- a/pig/README.md +++ b/pig/README.md @@ -12,6 +12,25 @@ Additional jars can be loaded, e.g. Ops: Cluster load high until end of 04/2021; putting lookups on hold. +# 05/2021 Run + +``` +$ source /home/webcrawl/hadoop-env/prod/setup-env.sh +$ zstdcat -T0 date-2021-05-06.tsv.zst | LC_ALL=C grep ^http > fatcat-refs-urllist-2021-05-06.tsv +$ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s +``` + +A test run with a single file. + +``` +$ pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-0 +``` + +* ia802401.us.archive.org:6988/ +* http://ia802401.us.archive.org:6988/cluster/app/application_1611217683160_298042 +* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_298042/ + + ---- # Previous Notes (BN) |