aboutsummaryrefslogtreecommitdiffstats
path: root/scalding/scalding-background.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-05-23 12:54:08 -0700
committerBryan Newbold <bnewbold@archive.org>2018-05-24 00:02:36 -0700
commit84f8937002e99025701d7b4a75fff95f78aedb16 (patch)
tree88a9651773ebfcef196a01a0cd6e67524b759f74 /scalding/scalding-background.md
parent74a3a8ea05824e7bc300f6889957b7903a78dee5 (diff)
downloadsandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.tar.gz
sandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.zip
cleanup scalding notes/README
Diffstat (limited to 'scalding/scalding-background.md')
-rw-r--r--scalding/scalding-background.md90
1 files changed, 90 insertions, 0 deletions
diff --git a/scalding/scalding-background.md b/scalding/scalding-background.md
new file mode 100644
index 0000000..f57022b
--- /dev/null
+++ b/scalding/scalding-background.md
@@ -0,0 +1,90 @@
+
+## Tips/Gotchas
+
+`.scala` file names should match internal classes.
+
+## Dev Environment
+
+Versions running on Bryan's Debian/Linux laptop:
+
+ openjdk version "1.8.0_171"
+ OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11)
+ OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
+
+ Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
+
+ sbt: 1.1.5
+
+Scala was installed via regular debian (stretch) `apt` repository; `sbt` using
+a bintray.com apt repo linked from the sbt website.
+
+## Creating a new project
+
+ sbt new scala/scala-seed.g8
+
+ # inserted additional deps, tweaked versions
+ # hadoop 2.5.0 seems to conflict with cascading; sticking with 2.6.0
+
+ sbt assembly
+ scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox:
+
+## Invoking on IA Cluster (old)
+
+This seemed to work (from scalding repo):
+
+ yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input test_cdx --output test_scalding_out1
+
+Or, with actual files on hadoop:
+
+ yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
+
+Horray! One issue with this was that building scalding took *forever* (meaning
+30+ minutes).
+
+potentially instead:
+
+ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool main.scala.example.WordCountJob --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
+
+Hypothesis: class name should be same as file name. Don't need `main` function
+if using Scalding Tool wrapper jar. Don't need scald.rb.
+
+ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.WordCount --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
+
+
+## Scalding Repo
+
+Got started by compiling and running examples (eg, Tutorial0) from the
+`twitter/scalding` upstream repo. That repo has some special magic: a `./sbt`
+wrapper script, and a `scripts/scald.rb` ruby script for invoking specific
+jobs. Didn't end up being necessary.
+
+Uncommenting this line in scalding:build.sbt sped things way up (don't need to
+run *all* the tests):
+
+ // Uncomment if you don't want to run all the tests before building assembly
+ // test in assembly := {},
+
+Also get the following error (in a different context):
+
+ bnewbold@orithena$ sbt new typesafehub/scala-sbt
+ [info] Loading project definition from /home/bnewbold/src/scala-sbt.g8/project/project
+ [info] Compiling 1 Scala source to /home/bnewbold/src/scala-sbt.g8/project/project/target/scala-2.9.1/sbt-0.11.2/classes...
+ [error] error while loading CharSequence, class file '/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken
+ [error] (bad constant pool tag 18 at byte 10)
+ [error] one error found
+ [error] {file:/home/bnewbold/src/scala-sbt.g8/project/project/}default-46da7b/compile:compile: Compilation failed
+ Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?
+
+## Resources
+
+Whole bunch of example commands (sbt, maven, gradle) to build scalding:
+
+ https://medium.com/@gayani.nan/how-to-run-a-scalding-job-567160fa193
+
+Also looks good:
+
+ https://blog.matthewrathbone.com/2015/10/20/scalding-tutorial.html
+
+Possibly related:
+
+ http://sujitpal.blogspot.com/2012/08/scalding-for-impatient.html