scalding/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

This directory contains Hadoop map/reduce jobs written in Scala (compiled to
the JVM) using the Scalding framework. Scalding builds on the Java Cascading
library, which itself builds on the Java Hadoop libraries.

See the other markdown files in this directory for more background and tips.

## Dependencies

To develop locally, you need to have the JVM (eg, OpenJDK 1.8), `sbt` build
tool, and might need (exactly) Scala version 2.11.8.

On a debian/ubuntu machine:

    echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
    sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
    sudo apt-get update
    sudo apt install scala sbt

It's also helpful to have a local copy of the `hadoop` binary for running
benchmarks. The `fetch_hadoop.sh` script in the top level directory will fetch
an appropriate version.

## Building and Running

You can run `sbt` commands individually:

    # run all test
    sbt test

    # build a jar (also runs tests)
    sbt assembly

Or you can start a session and run commands within that, which is *much*
faster:

    sbt -mem 2048

    sbt> test
    sbt> assembly
    sbt> testOnly sandcrawler.SomeTestClassName

On the cluster, you usually use the `please` script to kick off jobs. Be sure
to build the jars first, or pass `--rebuild` to do it automatically. You need
`hadoop` on your path for this.

## Troubleshooting

If your `sbt` task fails with this error:

     java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Metaspace

try restarting `sbt` with more memory (e.g., `sbt -mem 2048`).

See `scalding-debugging.md` or maybe `../notes/` for more.