1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
|
## Why Scalding
Scalding vs. Java (MapReduce) vs. Java (Cascading) vs. Scoobi vs. Scrunch:
- <https://speakerdeck.com/agemooij/why-hadoop-mapreduce-needs-scala?slide=34>
- <https://github.com/twitter/scalding/wiki/Comparison-to-Scrunch-and-Scoobi>
## Tips/Gotchas
`.scala` file names should match internal classes.
This project used to be called `scald-mvp`; now it's `sandcrawler`.
## Dev Environment
Versions running on Bryan's Debian/Linux laptop:
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
sbt: 1.1.5
Scala was installed via regular debian (stretch) `apt` repository; `sbt` using
a bintray.com apt repo linked from the sbt website.
## Creating a new project
sbt new scala/scala-seed.g8
# inserted additional deps, tweaked versions
# hadoop 2.5.0 seems to conflict with cascading; sticking with 2.6.0
sbt assembly
scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox:
## Invoking on IA Cluster (old)
This seemed to work (from scalding repo):
yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input test_cdx --output test_scalding_out1
Or, with actual files on hadoop:
yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
Horray! One issue with this was that building scalding took *forever* (meaning
30+ minutes).
potentially instead:
hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool main.scala.example.WordCountJob --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
Hypothesis: class name should be same as file name. Don't need `main` function
if using Scalding Tool wrapper jar. Don't need scald.rb.
hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.WordCount --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
## Scalding Repo
Got started by compiling and running examples (eg, Tutorial0) from the
`twitter/scalding` upstream repo. That repo has some special magic: a `./sbt`
wrapper script, and a `scripts/scald.rb` ruby script for invoking specific
jobs. Didn't end up being necessary.
Uncommenting this line in scalding:build.sbt sped things way up (don't need to
run *all* the tests):
// Uncomment if you don't want to run all the tests before building assembly
// test in assembly := {},
Also get the following error (in a different context):
bnewbold@orithena$ sbt new typesafehub/scala-sbt
[info] Loading project definition from /home/bnewbold/src/scala-sbt.g8/project/project
[info] Compiling 1 Scala source to /home/bnewbold/src/scala-sbt.g8/project/project/target/scala-2.9.1/sbt-0.11.2/classes...
[error] error while loading CharSequence, class file '/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken
[error] (bad constant pool tag 18 at byte 10)
[error] one error found
[error] {file:/home/bnewbold/src/scala-sbt.g8/project/project/}default-46da7b/compile:compile: Compilation failed
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?
## Resources
Whole bunch of example commands (sbt, maven, gradle) to build scalding:
https://medium.com/@gayani.nan/how-to-run-a-scalding-job-567160fa193
Also looks good:
https://blog.matthewrathbone.com/2015/10/20/scalding-tutorial.html
Possibly related:
http://sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
|