aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-12-26 21:35:36 -0800
committerBryan Newbold <bnewbold@archive.org>2019-12-26 21:35:36 -0800
commit172c426c4aa3fc3722813e32c08ee557c9b9d0cd (patch)
treeef05ea2fe9db4510a5aa6b789eccce6db178faab
parent905c116821dbcf0103323fcf8f0b58d2dfa81ddf (diff)
downloadsandcrawler-172c426c4aa3fc3722813e32c08ee557c9b9d0cd.tar.gz
sandcrawler-172c426c4aa3fc3722813e32c08ee557c9b9d0cd.zip
update job log with pig runs
-rw-r--r--notes/job_log.txt10
1 files changed, 10 insertions, 0 deletions
diff --git a/notes/job_log.txt b/notes/job_log.txt
index 68bef9b..67623ec 100644
--- a/notes/job_log.txt
+++ b/notes/job_log.txt
@@ -173,3 +173,13 @@ extract_chunk.sh:
touch $1.SUCCESS
seems to be working better! tested and if there is a problem with one chunk the others continue
+
+## Pig Joins (around 2019-12-24)
+
+Partial (as a start):
+
+ pig -param INPUT_CDX="/user/bnewbold/pdfs/gwb-pdf-20191005172329" -param INPUT_DIGEST="/user/bnewbold/scihash/shadow.20191222.sha1b32.sorted" -param OUTPUT="/user/bnewbold/scihash/gwb-pdf-20191005172329.shadow.20191222.join.cdx" join-cdx-sha1.pig
+
+Full GWB:
+
+ pig -param INPUT_CDX="/user/bnewbold/pdfs/gwb-pdf-20191005172329" -param INPUT_DIGEST="/user/bnewbold/scihash/shadow.20191222.sha1b32.sorted" -param OUTPUT="/user/bnewbold/scihash/gwb-pdf-20191005172329.shadow.20191222.join.cdx" join-cdx-sha1.pig