diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-05-23 12:54:08 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-05-24 00:02:36 -0700 |
commit | 84f8937002e99025701d7b4a75fff95f78aedb16 (patch) | |
tree | 88a9651773ebfcef196a01a0cd6e67524b759f74 /scalding | |
parent | 74a3a8ea05824e7bc300f6889957b7903a78dee5 (diff) | |
download | sandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.tar.gz sandcrawler-84f8937002e99025701d7b4a75fff95f78aedb16.zip |
cleanup scalding notes/README
Diffstat (limited to 'scalding')
-rw-r--r-- | scalding/README.md | 62 | ||||
-rw-r--r-- | scalding/scalding-background.md | 90 | ||||
-rw-r--r-- | scalding/scalding-debugging.md | 47 |
3 files changed, 162 insertions, 37 deletions
diff --git a/scalding/README.md b/scalding/README.md index e41e9ec..7f87fe0 100644 --- a/scalding/README.md +++ b/scalding/README.md @@ -1,52 +1,42 @@ -following https://medium.com/@gayani.nan/how-to-run-a-scalding-job-567160fa193 +This directory contains Hadoop map/reduce jobs written in Scala (compiled to +the JVM) using the Scalding framework. +See the other markdown files in this directory for more background and tips. -running on my laptop: +## Building and Running - openjdk version "1.8.0_171" - OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11) - OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode) +Locally, you need to have the JVM (eg, OpenJDK 1.8), `sbt` build tool, and +might need (exactly) Scala version 2.11.8. - Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL +See section below on building and installing custom SpyGlass jar. - sbt: 1.1.5 +Run tests: - sbt new scala/scala-seed.g8 + sbt test - # inserted additional deps, tweaked versions - # hadoop 2.5.0 seems to conflict with cascading; sticking with 2.6.0 +Build a jar and upload to a cluster machine (from which to run in production): sbt assembly - scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox: + scp scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox: - # on cluster: - yarn jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar WordCount --hdfs --input hdfs:///user/bnewbold/dummy.txt +Run on cluster: -later, using hadop command instead: + devbox$ touch thing.conf + devbox$ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar \ + com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs \ + --app.conf.path thing.conf \ + --output hdfs:///user/bnewbold/spyglass_out_test - hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.WordCountJob --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out3 +## Building SpyGlass Jar -helpful for debugging dependency woes: +SpyGlass is a "scalding-to-HBase" connector. It isn't maintained, so we needed +to rebuild to support our versions of HBase/scalding/etc. From SpyGlass fork +(<https://github.com/bnewbold/SpyGlass>, +`bnewbold-scala2.11` branch): - sbt dependencyTree - -testing the spyglass example program (expect a table error): - - hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.SimpleHBaseSourceExample --hdfs --output hdfs:///user/bnewbold/spyglass_out_test --app.conf.path thing.conf --debug true - # org.apache.hadoop.hbase.TableNotFoundException: table_name - -running a spyglass job (gives a nullpointer exception): - - hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs --output hdfs:///user/bnewbold/spyglass_out_test --app.conf.path thing.conf - - # Caused by: java.lang.NullPointerException - # at parallelai.spyglass.hbase.HBaseSource.<init>(HBaseSource.scala:48) - # at sandcrawler.HBaseRowCountJob.<init>(HBaseRowCountJob.scala:17) - -## Custom build - -in SpyGlass repo: + cd ~/src/SpyGlass + git checkout bnewbold-scala2.11 # This builds the new .jar and installs it in the (laptop local) ~/.m2 # repository @@ -56,8 +46,6 @@ in SpyGlass repo: mkdir -p ~/.sbt/preloaded/parallelai/ cp -r ~/.m2/repository/parallelai/parallelai.spyglass ~/.sbt/preloaded/parallelai/ - # then build here - sbt assembly - The medium-term plan here is to push the custom SpyGlass jar as a static maven repo to an archive.org item, and point build.sbt to that folder. + diff --git a/scalding/scalding-background.md b/scalding/scalding-background.md new file mode 100644 index 0000000..f57022b --- /dev/null +++ b/scalding/scalding-background.md @@ -0,0 +1,90 @@ + +## Tips/Gotchas + +`.scala` file names should match internal classes. + +## Dev Environment + +Versions running on Bryan's Debian/Linux laptop: + + openjdk version "1.8.0_171" + OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11) + OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode) + + Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL + + sbt: 1.1.5 + +Scala was installed via regular debian (stretch) `apt` repository; `sbt` using +a bintray.com apt repo linked from the sbt website. + +## Creating a new project + + sbt new scala/scala-seed.g8 + + # inserted additional deps, tweaked versions + # hadoop 2.5.0 seems to conflict with cascading; sticking with 2.6.0 + + sbt assembly + scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox: + +## Invoking on IA Cluster (old) + +This seemed to work (from scalding repo): + + yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input test_cdx --output test_scalding_out1 + +Or, with actual files on hadoop: + + yarn jar tutorial/execution-tutorial/target/scala-2.11/execution-tutorial-assembly-0.18.0-SNAPSHOT.jar Tutorial1 --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2 + +Horray! One issue with this was that building scalding took *forever* (meaning +30+ minutes). + +potentially instead: + + hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool main.scala.example.WordCountJob --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2 + +Hypothesis: class name should be same as file name. Don't need `main` function +if using Scalding Tool wrapper jar. Don't need scald.rb. + + hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.WordCount --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out2 + + +## Scalding Repo + +Got started by compiling and running examples (eg, Tutorial0) from the +`twitter/scalding` upstream repo. That repo has some special magic: a `./sbt` +wrapper script, and a `scripts/scald.rb` ruby script for invoking specific +jobs. Didn't end up being necessary. + +Uncommenting this line in scalding:build.sbt sped things way up (don't need to +run *all* the tests): + + // Uncomment if you don't want to run all the tests before building assembly + // test in assembly := {}, + +Also get the following error (in a different context): + + bnewbold@orithena$ sbt new typesafehub/scala-sbt + [info] Loading project definition from /home/bnewbold/src/scala-sbt.g8/project/project + [info] Compiling 1 Scala source to /home/bnewbold/src/scala-sbt.g8/project/project/target/scala-2.9.1/sbt-0.11.2/classes... + [error] error while loading CharSequence, class file '/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken + [error] (bad constant pool tag 18 at byte 10) + [error] one error found + [error] {file:/home/bnewbold/src/scala-sbt.g8/project/project/}default-46da7b/compile:compile: Compilation failed + Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? + +## Resources + +Whole bunch of example commands (sbt, maven, gradle) to build scalding: + + https://medium.com/@gayani.nan/how-to-run-a-scalding-job-567160fa193 + +Also looks good: + + https://blog.matthewrathbone.com/2015/10/20/scalding-tutorial.html + +Possibly related: + + http://sujitpal.blogspot.com/2012/08/scalding-for-impatient.html diff --git a/scalding/scalding-debugging.md b/scalding/scalding-debugging.md new file mode 100644 index 0000000..2e29fce --- /dev/null +++ b/scalding/scalding-debugging.md @@ -0,0 +1,47 @@ + +Quick tips for debugging scalding issues... + +## Dependencies + +Print the dependency graph (using the `sbt-dependency-graph` plugin): + + sbt dependencyTree + +## Old Errors + +At one phase, was getting `NullPointerException` errors when running tests or +in production, like: + + bnewbold@bnewbold-dev$ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs --output hdfs:///user/bnewbold/spyglass_out_test + Exception in thread "main" java.lang.Throwable: If you know what exactly caused this error, please consider contributing to GitHub via following link. + https://github.com/twitter/scalding/wiki/Common-Exceptions-and-possible-reasons#javalangnullpointerexception + at com.twitter.scalding.Tool$.main(Tool.scala:152) + at com.twitter.scalding.Tool.main(Tool.scala) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:498) + at org.apache.hadoop.util.RunJar.main(RunJar.java:212) + Caused by: java.lang.reflect.InvocationTargetException + at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) + at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) + at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) + at java.lang.reflect.Constructor.newInstance(Constructor.java:423) + at com.twitter.scalding.Job$.apply(Job.scala:44) + at com.twitter.scalding.Tool.getJob(Tool.scala:49) + at com.twitter.scalding.Tool.run(Tool.scala:68) + at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) + at com.twitter.scalding.Tool$.main(Tool.scala:148) + ... 6 more + Caused by: java.lang.NullPointerException + at parallelai.spyglass.hbase.HBaseSource.<init>(HBaseSource.scala:48) + at sandcrawler.HBaseRowCountJob.<init>(HBaseRowCountJob.scala:14) + ... 15 more + +This was resolved by ensuring that all required parameters were being passed to +the `HBaseSource` constructor. + +Another time, saw a bunch of `None.get` errors when running tests. These were +resolved by ensuring that the `HBaseSource` constructors had exactly identical +names and arguments (eg, table names and zookeeper quorums have to be exact +matches). |