aboutsummaryrefslogtreecommitdiffstats
path: root/pig
diff options
context:
space:
mode:
Diffstat (limited to 'pig')
-rw-r--r--pig/README.md19
1 files changed, 19 insertions, 0 deletions
diff --git a/pig/README.md b/pig/README.md
index 6d41658..5281169 100644
--- a/pig/README.md
+++ b/pig/README.md
@@ -12,6 +12,25 @@ Additional jars can be loaded, e.g.
Ops: Cluster load high until end of 04/2021; putting lookups on hold.
+# 05/2021 Run
+
+```
+$ source /home/webcrawl/hadoop-env/prod/setup-env.sh
+$ zstdcat -T0 date-2021-05-06.tsv.zst | LC_ALL=C grep ^http > fatcat-refs-urllist-2021-05-06.tsv
+$ time gohdfs put fatcat-refs-urllist-2021-05-06.tsv /user/martin # 36s
+```
+
+A test run with a single file.
+
+```
+$ pig -p INPUT_CDX=/user/wmdata2/cdx-all-index/20210422171221/part-a-00031.gz -p INPUT_URLS=/user/martin/fatcat-refs-urllist-2021-05-06.tsv -p OUTPUT=/user/martin/fatcat-refs-lookup-0
+```
+
+* ia802401.us.archive.org:6988/
+* http://ia802401.us.archive.org:6988/cluster/app/application_1611217683160_298042
+* http://ia802401.us.archive.org:6988/proxy/application_1611217683160_298042/
+
+
----
# Previous Notes (BN)