diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-12-26 21:35:36 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-12-26 21:35:36 -0800 |
commit | 172c426c4aa3fc3722813e32c08ee557c9b9d0cd (patch) | |
tree | ef05ea2fe9db4510a5aa6b789eccce6db178faab /notes | |
parent | 905c116821dbcf0103323fcf8f0b58d2dfa81ddf (diff) | |
download | sandcrawler-172c426c4aa3fc3722813e32c08ee557c9b9d0cd.tar.gz sandcrawler-172c426c4aa3fc3722813e32c08ee557c9b9d0cd.zip |
update job log with pig runs
Diffstat (limited to 'notes')
-rw-r--r-- | notes/job_log.txt | 10 |
1 files changed, 10 insertions, 0 deletions
diff --git a/notes/job_log.txt b/notes/job_log.txt index 68bef9b..67623ec 100644 --- a/notes/job_log.txt +++ b/notes/job_log.txt @@ -173,3 +173,13 @@ extract_chunk.sh: touch $1.SUCCESS seems to be working better! tested and if there is a problem with one chunk the others continue + +## Pig Joins (around 2019-12-24) + +Partial (as a start): + + pig -param INPUT_CDX="/user/bnewbold/pdfs/gwb-pdf-20191005172329" -param INPUT_DIGEST="/user/bnewbold/scihash/shadow.20191222.sha1b32.sorted" -param OUTPUT="/user/bnewbold/scihash/gwb-pdf-20191005172329.shadow.20191222.join.cdx" join-cdx-sha1.pig + +Full GWB: + + pig -param INPUT_CDX="/user/bnewbold/pdfs/gwb-pdf-20191005172329" -param INPUT_DIGEST="/user/bnewbold/scihash/shadow.20191222.sha1b32.sorted" -param OUTPUT="/user/bnewbold/scihash/gwb-pdf-20191005172329.shadow.20191222.join.cdx" join-cdx-sha1.pig |