aboutsummaryrefslogtreecommitdiffstats
path: root/scalding
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-08-24 12:40:33 -0700
committerBryan Newbold <bnewbold@archive.org>2018-08-24 12:40:33 -0700
commit8374e6bb72c095755ab2185e9205e776863cfd5f (patch)
tree89fae4276dc92ad7265bfce237ed8d55aef2758e /scalding
parent1dd9e8da5912ef0f190aacf20d27586559a277f5 (diff)
downloadsandcrawler-8374e6bb72c095755ab2185e9205e776863cfd5f.tar.gz
sandcrawler-8374e6bb72c095755ab2185e9205e776863cfd5f.zip
update scalding README
Diffstat (limited to 'scalding')
-rw-r--r--scalding/README.md36
1 files changed, 23 insertions, 13 deletions
diff --git a/scalding/README.md b/scalding/README.md
index 45b62d0..b09e0e8 100644
--- a/scalding/README.md
+++ b/scalding/README.md
@@ -1,12 +1,13 @@
This directory contains Hadoop map/reduce jobs written in Scala (compiled to
-the JVM) using the Scalding framework.
+the JVM) using the Scalding framework. Scalding builds on the Java Cascading
+library, which itself builds on the Java Hadoop libraries.
See the other markdown files in this directory for more background and tips.
## Dependencies
-Locally, you need to have the JVM (eg, OpenJDK 1.8), `sbt` build tool, and
-might need (exactly) Scala version 2.11.8.
+To develop locally, you need to have the JVM (eg, OpenJDK 1.8), `sbt` build
+tool, and might need (exactly) Scala version 2.11.8.
On a debian/ubuntu machine:
@@ -15,24 +16,32 @@ On a debian/ubuntu machine:
sudo apt-get update
sudo apt install scala sbt
+It's also helpful to have a local copy of the `hadoop` binary for running
+benchmarks. The `fetch_hadoop.sh` script in the top level directory will fetch
+an appropriate version.
+
## Building and Running
-Run tests:
+You can run `sbt` commands individually:
+ # run all test
sbt test
-Build a jar and upload to a cluster machine (from which to run in production):
-
+ # build a jar (also runs tests)
sbt assembly
- scp target/scala-2.11/sandcrawler-assembly-0.2.0-SNAPSHOT.jar devbox:
-Run on cluster:
+Or you can start a session and run commands within that, which is *much*
+faster:
+
+ sbt -mem 2048
+
+ sbt> test
+ sbt> assembly
+ sbt> testOnly sandcrawler.SomeTestClassName
- devbox$ touch thing.conf
- devbox$ hadoop jar sandcrawler-assembly-0.2.0-SNAPSHOT.jar \
- com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs \
- --app.conf.path thing.conf \
- --output hdfs:///user/bnewbold/spyglass_out_test
+On the cluster, you usually use the `please` script to kick off jobs. Be sure
+to build the jars first, or pass `--rebuild` to do it automatically. You need
+`hadoop` on your path for this.
## Troubleshooting
@@ -42,3 +51,4 @@ If your `sbt` task fails with this error:
try restarting `sbt` with more memory (e.g., `sbt -mem 2048`).
+See `scalding-debugging.md` or maybe `../notes/` for more.