blob: 13c2e1a3246f277b41e304bb3721a34d8d463e6a (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
|
This directory contains Hadoop map/reduce jobs written in Scala (compiled to
the JVM) using the Scalding framework.
See the other markdown files in this directory for more background and tips.
## Building and Running
Locally, you need to have the JVM (eg, OpenJDK 1.8), `sbt` build tool, and
might need (exactly) Scala version 2.11.8.
See section below on building and installing custom SpyGlass jar.
Run tests:
sbt test
Build a jar and upload to a cluster machine (from which to run in production):
sbt assembly
scp target/scala-2.11/sandcrawler-assembly-0.2.0-SNAPSHOT.jar devbox:
Run on cluster:
devbox$ touch thing.conf
devbox$ hadoop jar sandcrawler-assembly-0.2.0-SNAPSHOT.jar \
com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs \
--app.conf.path thing.conf \
--output hdfs:///user/bnewbold/spyglass_out_test
If your `sbt` task fails with this error:
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Metaspace
try restarting `sbt` with more memory (e.g., `sbt -mem 2048`).
## Building SpyGlass Jar
SpyGlass is a "scalding-to-HBase" connector. It isn't maintained, so we needed
to rebuild to support our versions of HBase/scalding/etc. From SpyGlass fork
(<https://github.com/bnewbold/SpyGlass>,
`bnewbold-scala2.11` branch):
cd ~/src/SpyGlass
git checkout bnewbold-scala2.11
# This builds the new .jar and installs it in the (laptop local) ~/.m2
# repository
mvn clean install -U
# Copy that .jar (and associated pom.xml) over to where sbt can find it
mkdir -p ~/.sbt/preloaded/parallelai/
cp -r ~/.m2/repository/parallelai/parallelai.spyglass ~/.sbt/preloaded/parallelai/
The medium-term plan here is to push the custom SpyGlass jar as a static maven
repo to an archive.org item, and point build.sbt to that folder.
|