aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-05-23 12:54:08 -0700
committerBryan Newbold <bnewbold@archive.org>2018-05-24 00:02:36 -0700
commit84f8937002e99025701d7b4a75fff95f78aedb16 (patch)
tree88a9651773ebfcef196a01a0cd6e67524b759f74
parent74a3a8ea05824e7bc300f6889957b7903a78dee5 (diff)
downloadsandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.tar.gz
sandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.zip
cleanup scalding notes/README
-rw-r--r--scalding/README.md62
-rw-r--r--scalding/scalding-background.md90
-rw-r--r--scalding/scalding-debugging.md47
3 files changed, 162 insertions, 37 deletions
diff --git a/scalding/README.md b/scalding/README.md
index e41e9ec..7f87fe0 100644
--- a/scalding/README.md
+++ b/scalding/README.md
@@ -1,52 +1,42 @@
-following https://medium.com/@gayani.nan/how-to-run-a-scalding-job-567160fa193
+This directory contains Hadoop map/reduce jobs written in Scala (compiled to
+the JVM) using the Scalding framework.
+See the other markdown files in this directory for more background and tips.
-running on my laptop:
+## Building and Running
- openjdk version "1.8.0_171"
- OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11)
- OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
+Locally, you need to have the JVM (eg, OpenJDK 1.8), `sbt` build tool, and
+might need (exactly) Scala version 2.11.8.
- Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
+See section below on building and installing custom SpyGlass jar.
- sbt: 1.1.5
+Run tests:
- sbt new scala/scala-seed.g8
+ sbt test
- # inserted additional deps, tweaked versions
- # hadoop 2.5.0 seems to conflict with cascading; sticking with 2.6.0
+Build a jar and upload to a cluster machine (from which to run in production):
sbt assembly
- scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox:
+ scp scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox:
- # on cluster:
- yarn jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar WordCount --hdfs --input hdfs:///user/bnewbold/dummy.txt
+Run on cluster:
-later, using hadop command instead:
+ devbox$ touch thing.conf
+ devbox$ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar \
+ com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs \
+ --app.conf.path thing.conf \
+ --output hdfs:///user/bnewbold/spyglass_out_test
- hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.WordCountJob --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out3
+## Building SpyGlass Jar
-helpful for debugging dependency woes:
+SpyGlass is a "scalding-to-HBase" connector. It isn't maintained, so we needed
+to rebuild to support our versions of HBase/scalding/etc. From SpyGlass fork
+(<https://github.com/bnewbold/SpyGlass>,
+`bnewbold-scala2.11` branch):
- sbt dependencyTree
-
-testing the spyglass example program (expect a table error):
-
- hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.SimpleHBaseSourceExample --hdfs --output hdfs:///user/bnewbold/spyglass_out_test --app.conf.path thing.conf --debug true
- # org.apache.hadoop.hbase.TableNotFoundException: table_name
-
-running a spyglass job (gives a nullpointer exception):
-
- hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs --output hdfs:///user/bnewbold/spyglass_out_test --app.conf.path thing.conf
-
- # Caused by: java.lang.NullPointerException
- # at parallelai.spyglass.hbase.HBaseSource.<init>(HBaseSource.scala:48)
- # at sandcrawler.HBaseRowCountJob.<init>(HBaseRowCountJob.scala:17)
-
-## Custom build
-
-in SpyGlass repo:
+ cd ~/src/SpyGlass
+ git checkout bnewbold-scala2.11
# This builds the new .jar and installs it in the (laptop local) ~/.m2
# repository
@@ -56,8 +46,6 @@ in SpyGlass repo:
mkdir -p ~/.sbt/preloaded/parallelai/
cp -r ~/.m2/repository/parallelai/parallelai.spyglass ~/.sbt/preloaded/parallelai/
- # then build here
- sbt assembly
-
The medium-term plan here is to push the custom SpyGlass jar as a static maven
repo to an archive.org item, and point build.sbt to that folder.
+
diff --git a/scalding/scalding-background.md b/scalding/scalding-background.md
new file mode 100644
index 0000000..f57022b
--- /dev/null
+++ b/scalding/scalding-background.md
@@ -0,0 +1,90 @@
+
+## Tips/Gotchas
+
+`.scala` file names should match internal classes.
+
+## Dev Environment
+
+Versions running on Bryan's Debian/Linux laptop:
+
+ openjdk version "1.8.0_171"
+ OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11)
+ OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
+
+ Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
+
+ sbt: 1.1.5
+
+Scala was installed via regular debian (stretch) `apt` repository; `sbt` using
+a bintray.com apt repo linked from the sbt website.
+
+## Creating a new project
+
+ sbt new scala/scala-seed.g8
+
+ # inserted additional deps, tweaked versions
+ # hadoop 2.5.0 seems to conflict with cascading; sticking with 2.6.0
+
+ sbt assembly
+ scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox:
+
+## Invoking on IA Cluster (old)
+
+This seemed to work (from scalding repo):
+
+ yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input test_cdx --output test_scalding_out1
+
+Or, with actual files on hadoop:
+
+ yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
+
+Horray! One issue with this was that building scalding took *forever* (meaning
+30+ minutes).
+
+potentially instead:
+
+ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool main.scala.example.WordCountJob --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
+
+Hypothesis: class name should be same as file name. Don't need `main` function
+if using Scalding Tool wrapper jar. Don't need scald.rb.
+
+ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.WordCount --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
+
+
+## Scalding Repo
+
+Got started by compiling and running examples (eg, Tutorial0) from the
+`twitter/scalding` upstream repo. That repo has some special magic: a `./sbt`
+wrapper script, and a `scripts/scald.rb` ruby script for invoking specific
+jobs. Didn't end up being necessary.
+
+Uncommenting this line in scalding:build.sbt sped things way up (don't need to
+run *all* the tests):
+
+ // Uncomment if you don't want to run all the tests before building assembly
+ // test in assembly := {},
+
+Also get the following error (in a different context):
+
+ bnewbold@orithena$ sbt new typesafehub/scala-sbt
+ [info] Loading project definition from /home/bnewbold/src/scala-sbt.g8/project/project
+ [info] Compiling 1 Scala source to /home/bnewbold/src/scala-sbt.g8/project/project/target/scala-2.9.1/sbt-0.11.2/classes...
+ [error] error while loading CharSequence, class file '/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken
+ [error] (bad constant pool tag 18 at byte 10)
+ [error] one error found
+ [error] {file:/home/bnewbold/src/scala-sbt.g8/project/project/}default-46da7b/compile:compile: Compilation failed
+ Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?
+
+## Resources
+
+Whole bunch of example commands (sbt, maven, gradle) to build scalding:
+
+ https://medium.com/@gayani.nan/how-to-run-a-scalding-job-567160fa193
+
+Also looks good:
+
+ https://blog.matthewrathbone.com/2015/10/20/scalding-tutorial.html
+
+Possibly related:
+
+ http://sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
diff --git a/scalding/scalding-debugging.md b/scalding/scalding-debugging.md
new file mode 100644
index 0000000..2e29fce
--- /dev/null
+++ b/scalding/scalding-debugging.md
@@ -0,0 +1,47 @@
+
+Quick tips for debugging scalding issues...
+
+## Dependencies
+
+Print the dependency graph (using the `sbt-dependency-graph` plugin):
+
+ sbt dependencyTree
+
+## Old Errors
+
+At one phase, was getting `NullPointerException` errors when running tests or
+in production, like:
+
+ bnewbold@bnewbold-dev$ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs --output hdfs:///user/bnewbold/spyglass_out_test
+ Exception in thread "main" java.lang.Throwable: If you know what exactly caused this error, please consider contributing to GitHub via following link.
+ https://github.com/twitter/scalding/wiki/Common-Exceptions-and-possible-reasons#javalangnullpointerexception
+ at com.twitter.scalding.Tool$.main(Tool.scala:152)
+ at com.twitter.scalding.Tool.main(Tool.scala)
+ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
+ at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
+ at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+ at java.lang.reflect.Method.invoke(Method.java:498)
+ at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
+ Caused by: java.lang.reflect.InvocationTargetException
+ at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
+ at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
+ at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
+ at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
+ at com.twitter.scalding.Job$.apply(Job.scala:44)
+ at com.twitter.scalding.Tool.getJob(Tool.scala:49)
+ at com.twitter.scalding.Tool.run(Tool.scala:68)
+ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
+ at com.twitter.scalding.Tool$.main(Tool.scala:148)
+ ... 6 more
+ Caused by: java.lang.NullPointerException
+ at parallelai.spyglass.hbase.HBaseSource.<init>(HBaseSource.scala:48)
+ at sandcrawler.HBaseRowCountJob.<init>(HBaseRowCountJob.scala:14)
+ ... 15 more
+
+This was resolved by ensuring that all required parameters were being passed to
+the `HBaseSource` constructor.
+
+Another time, saw a bunch of `None.get` errors when running tests. These were
+resolved by ensuring that the `HBaseSource` constructors had exactly identical
+names and arguments (eg, table names and zookeeper quorums have to be exact
+matches).