aboutsummaryrefslogtreecommitdiffstats
path: root/pig
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-05-17 21:05:06 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-05-17 21:05:06 +0200
commitbb0e9b312f8248e882f8650897966ff57117aa17 (patch)
treed4278d948712a1da65853d25c6d8786280341325 /pig
parent7e3a9fdb956f6c75ac281dca637cb862edd91ae6 (diff)
downloadrefcat-bb0e9b312f8248e882f8650897966ff57117aa17.tar.gz
refcat-bb0e9b312f8248e882f8650897966ff57117aa17.zip
pig: notes
Diffstat (limited to 'pig')
-rw-r--r--pig/README.md19
1 files changed, 19 insertions, 0 deletions
diff --git a/pig/README.md b/pig/README.md
index 6d41658..5281169 100644
--- a/pig/README.md
+++ b/pig/README.md
@@ -12,6 +12,25 @@ Additional jars can be loaded, e.g.
Ops: Cluster load high until end of 04/2021; putting lookups on hold.
+# 05/2021 Run
+
+```
+$ source /home/webcrawl/hadoop-env/prod/setup-env.sh
+$ zstdcat -T0 date-2021-05-06.tsv.zst | LC_ALL=C grep ^http > fatcat-refs-urllist-2021-05-06.tsv
+$ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s
+```
+
+A test run with a single file.
+
+```
+$ pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-0
+```
+
+* ia802401.us.archive.org:6988/
+* http://ia802401.us.archive.org:6988/cluster/app/application_1611217683160_298042
+* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_298042/
+
+
----
# Previous Notes (BN)