cleanup scalding notes/README

author: Bryan Newbold <bnewbold@archive.org> 2018-05-23 12:54:08 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2018-05-24 00:02:36 -0700
commit: 84f8937002e99025701d7b4a75fff95f78aedb16 (patch)
tree: 88a9651773ebfcef196a01a0cd6e67524b759f74
parent: 74a3a8ea05824e7bc300f6889957b7903a78dee5 (diff)
download: sandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.tar.gz
sandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.zip
3 files changed, 162 insertions, 37 deletions
diff --git a/scalding/README.md b/scalding/README.md
index e41e9ec..7f87fe0 100644
--- a/scalding/README.md
+++ b/scalding/README.md
@@ -1,52 +1,42 @@
 
-following https://medium.com/@gayani.nan/how-to-run-a-scalding-job-567160fa193
+This directory contains Hadoop map/reduce jobs written in Scala (compiled to
+the JVM) using the Scalding framework.
 
+See the other markdown files in this directory for more background and tips.
 
-running on my laptop:
+## Building and Running
 
-    openjdk version "1.8.0_171"
-    OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11)
-    OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
+Locally, you need to have the JVM (eg, OpenJDK 1.8), `sbt` build tool, and
+might need (exactly) Scala version 2.11.8.
 
-    Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
+See section below on building and installing custom SpyGlass jar.
 
-    sbt: 1.1.5
+Run tests:
 
-    sbt new scala/scala-seed.g8
+    sbt test
 
-    # inserted additional deps, tweaked versions
-    # hadoop 2.5.0 seems to conflict with cascading; sticking with 2.6.0
+Build a jar and upload to a cluster machine (from which to run in production):
 
     sbt assembly
-    scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox:
+    scp scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox:
 
-    # on cluster:
-    yarn jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar WordCount --hdfs --input hdfs:///user/bnewbold/dummy.txt
+Run on cluster:
 
-later, using hadop command instead:
+    devbox$ touch thing.conf
+    devbox$ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar \
+        com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs \
+        --app.conf.path thing.conf \
+        --output hdfs:///user/bnewbold/spyglass_out_test 
 
-    hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.WordCountJob --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out3
+## Building SpyGlass Jar
 
-helpful for debugging dependency woes:
+SpyGlass is a "scalding-to-HBase" connector. It isn't maintained, so we needed
+to rebuild to support our versions of HBase/scalding/etc. From SpyGlass fork
+(<https://github.com/bnewbold/SpyGlass>,
+`bnewbold-scala2.11` branch):
 
-    sbt dependencyTree
-
-testing the spyglass example program (expect a table error):
-
-    hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.SimpleHBaseSourceExample --hdfs --output hdfs:///user/bnewbold/spyglass_out_test --app.conf.path thing.conf --debug true
-    # org.apache.hadoop.hbase.TableNotFoundException: table_name
-
-running a spyglass job (gives a nullpointer exception):
-
-    hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs --output hdfs:///user/bnewbold/spyglass_out_test --app.conf.path thing.conf
-
-    # Caused by: java.lang.NullPointerException
-    #         at parallelai.spyglass.hbase.HBaseSource.<init>(HBaseSource.scala:48)
-    #         at sandcrawler.HBaseRowCountJob.<init>(HBaseRowCountJob.scala:17)
-
-## Custom build
-
-in SpyGlass repo:
+    cd ~/src/SpyGlass
+    git checkout bnewbold-scala2.11
 
     # This builds the new .jar and installs it in the (laptop local) ~/.m2
     # repository
@@ -56,8 +46,6 @@ in SpyGlass repo:
     mkdir -p ~/.sbt/preloaded/parallelai/
     cp -r ~/.m2/repository/parallelai/parallelai.spyglass ~/.sbt/preloaded/parallelai/
 
-    # then build here
-    sbt assembly
-
 The medium-term plan here is to push the custom SpyGlass jar as a static maven
 repo to an archive.org item, and point build.sbt to that folder.
+
diff --git a/scalding/scalding-background.md b/scalding/scalding-background.md
new file mode 100644
index 0000000..f57022b
--- /dev/null
+++ b/scalding/scalding-background.md
@@ -0,0 +1,90 @@
+
+## Tips/Gotchas
+
+`.scala` file names should match internal classes.
+
+## Dev Environment
+
+Versions running on Bryan's Debian/Linux laptop:
+
+    openjdk version "1.8.0_171"
+    OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11)
+    OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
+
+    Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
+
+    sbt: 1.1.5
+
+Scala was installed via regular debian (stretch) `apt` repository; `sbt` using
+a bintray.com apt repo linked from the sbt website.
+
+## Creating a new project
+
+    sbt new scala/scala-seed.g8
+
+    # inserted additional deps, tweaked versions
+    # hadoop 2.5.0 seems to conflict with cascading; sticking with 2.6.0
+
+    sbt assembly
+    scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox:
+
+## Invoking on IA Cluster (old)
+
+This seemed to work (from scalding repo):
+
+    yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input test_cdx --output test_scalding_out1
+
+Or, with actual files on hadoop:
+
+    yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
+
+Horray! One issue with this was that building scalding took *forever* (meaning
+30+ minutes).
+
+potentially instead:
+
+    hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool main.scala.example.WordCountJob --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
+
+Hypothesis: class name should be same as file name. Don't need `main` function
+if using Scalding Tool wrapper jar. Don't need scald.rb.
+
+    hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.WordCount --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2
+
+
+## Scalding Repo
+
+Got started by compiling and running examples (eg, Tutorial0) from the
+`twitter/scalding` upstream repo. That repo has some special magic: a `./sbt`
+wrapper script, and a `scripts/scald.rb` ruby script for invoking specific
+jobs. Didn't end up being necessary.
+
+Uncommenting this line in scalding:build.sbt sped things way up (don't need to
+run *all* the tests):
+
+       // Uncomment if you don't want to run all the tests before building assembly
+       // test in assembly := {},
+
+Also get the following error (in a different context):
+
+    bnewbold@orithena$ sbt new typesafehub/scala-sbt
+    [info] Loading project definition from /home/bnewbold/src/scala-sbt.g8/project/project
+    [info] Compiling 1 Scala source to /home/bnewbold/src/scala-sbt.g8/project/project/target/scala-2.9.1/sbt-0.11.2/classes...
+    [error] error while loading CharSequence, class file '/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken
+    [error] (bad constant pool tag 18 at byte 10)
+    [error] one error found
+    [error] {file:/home/bnewbold/src/scala-sbt.g8/project/project/}default-46da7b/compile:compile: Compilation failed
+    Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?  
+
+## Resources
+
+Whole bunch of example commands (sbt, maven, gradle) to build scalding:
+
+    https://medium.com/@gayani.nan/how-to-run-a-scalding-job-567160fa193
+
+Also looks good:
+
+    https://blog.matthewrathbone.com/2015/10/20/scalding-tutorial.html
+
+Possibly related:
+
+    http://sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
diff --git a/scalding/scalding-debugging.md b/scalding/scalding-debugging.md
new file mode 100644
index 0000000..2e29fce
--- /dev/null
+++ b/scalding/scalding-debugging.md
@@ -0,0 +1,47 @@
+
+Quick tips for debugging scalding issues...
+
+## Dependencies
+
+Print the dependency graph (using the `sbt-dependency-graph` plugin):
+
+    sbt dependencyTree
+
+## Old Errors
+
+At one phase, was getting `NullPointerException` errors when running tests or
+in production, like:
+
+    bnewbold@bnewbold-dev$ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs --output hdfs:///user/bnewbold/spyglass_out_test
+    Exception in thread "main" java.lang.Throwable: If you know what exactly caused this error, please consider contributing to GitHub via following link.
+    https://github.com/twitter/scalding/wiki/Common-Exceptions-and-possible-reasons#javalangnullpointerexception
+            at com.twitter.scalding.Tool$.main(Tool.scala:152)
+            at com.twitter.scalding.Tool.main(Tool.scala)
+            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
+            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
+            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+            at java.lang.reflect.Method.invoke(Method.java:498)
+            at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
+    Caused by: java.lang.reflect.InvocationTargetException
+            at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
+            at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
+            at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
+            at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
+            at com.twitter.scalding.Job$.apply(Job.scala:44)
+            at com.twitter.scalding.Tool.getJob(Tool.scala:49)
+            at com.twitter.scalding.Tool.run(Tool.scala:68)
+            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
+            at com.twitter.scalding.Tool$.main(Tool.scala:148)
+            ... 6 more
+    Caused by: java.lang.NullPointerException
+            at parallelai.spyglass.hbase.HBaseSource.<init>(HBaseSource.scala:48)
+            at sandcrawler.HBaseRowCountJob.<init>(HBaseRowCountJob.scala:14)
+            ... 15 more
+
+This was resolved by ensuring that all required parameters were being passed to
+the `HBaseSource` constructor.
+
+Another time, saw a bunch of `None.get` errors when running tests. These were
+resolved by ensuring that the `HBaseSource` constructors had exactly identical
+names and arguments (eg, table names and zookeeper quorums have to be exact
+matches).
author	Bryan Newbold <bnewbold@archive.org>	2018-05-23 12:54:08 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2018-05-24 00:02:36 -0700
commit	84f8937002e99025701d7b4a75fff95f78aedb16 (patch)
tree	88a9651773ebfcef196a01a0cd6e67524b759f74
parent	74a3a8ea05824e7bc300f6889957b7903a78dee5 (diff)
download	sandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.tar.gz sandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.zip