diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-08-24 12:40:33 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-08-24 12:40:33 -0700 |
commit | 8374e6bb72c095755ab2185e9205e776863cfd5f (patch) | |
tree | 89fae4276dc92ad7265bfce237ed8d55aef2758e /scalding | |
parent | 1dd9e8da5912ef0f190aacf20d27586559a277f5 (diff) | |
download | sandcrawler-8374e6bb72c095755ab2185e9205e776863cfd5f.tar.gz sandcrawler-8374e6bb72c095755ab2185e9205e776863cfd5f.zip |
update scalding README
Diffstat (limited to 'scalding')
-rw-r--r-- | scalding/README.md | 36 |
1 files changed, 23 insertions, 13 deletions
diff --git a/scalding/README.md b/scalding/README.md index 45b62d0..b09e0e8 100644 --- a/scalding/README.md +++ b/scalding/README.md @@ -1,12 +1,13 @@ This directory contains Hadoop map/reduce jobs written in Scala (compiled to -the JVM) using the Scalding framework. +the JVM) using the Scalding framework. Scalding builds on the Java Cascading +library, which itself builds on the Java Hadoop libraries. See the other markdown files in this directory for more background and tips. ## Dependencies -Locally, you need to have the JVM (eg, OpenJDK 1.8), `sbt` build tool, and -might need (exactly) Scala version 2.11.8. +To develop locally, you need to have the JVM (eg, OpenJDK 1.8), `sbt` build +tool, and might need (exactly) Scala version 2.11.8. On a debian/ubuntu machine: @@ -15,24 +16,32 @@ On a debian/ubuntu machine: sudo apt-get update sudo apt install scala sbt +It's also helpful to have a local copy of the `hadoop` binary for running +benchmarks. The `fetch_hadoop.sh` script in the top level directory will fetch +an appropriate version. + ## Building and Running -Run tests: +You can run `sbt` commands individually: + # run all test sbt test -Build a jar and upload to a cluster machine (from which to run in production): - + # build a jar (also runs tests) sbt assembly - scp target/scala-2.11/sandcrawler-assembly-0.2.0-SNAPSHOT.jar devbox: -Run on cluster: +Or you can start a session and run commands within that, which is *much* +faster: + + sbt -mem 2048 + + sbt> test + sbt> assembly + sbt> testOnly sandcrawler.SomeTestClassName - devbox$ touch thing.conf - devbox$ hadoop jar sandcrawler-assembly-0.2.0-SNAPSHOT.jar \ - com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs \ - --app.conf.path thing.conf \ - --output hdfs:///user/bnewbold/spyglass_out_test +On the cluster, you usually use the `please` script to kick off jobs. Be sure +to build the jars first, or pass `--rebuild` to do it automatically. You need +`hadoop` on your path for this. ## Troubleshooting @@ -42,3 +51,4 @@ If your `sbt` task fails with this error: try restarting `sbt` with more memory (e.g., `sbt -mem 2048`). +See `scalding-debugging.md` or maybe `../notes/` for more. |