aboutsummaryrefslogtreecommitdiffstats
path: root/scalding/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'scalding/README.md')
-rw-r--r--scalding/README.md63
1 files changed, 63 insertions, 0 deletions
diff --git a/scalding/README.md b/scalding/README.md
new file mode 100644
index 0000000..e41e9ec
--- /dev/null
+++ b/scalding/README.md
@@ -0,0 +1,63 @@
+
+following https://medium.com/@gayani.nan/how-to-run-a-scalding-job-567160fa193
+
+
+running on my laptop:
+
+ openjdk version "1.8.0_171"
+ OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~deb9u1-b11)
+ OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)
+
+ Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
+
+ sbt: 1.1.5
+
+ sbt new scala/scala-seed.g8
+
+ # inserted additional deps, tweaked versions
+ # hadoop 2.5.0 seems to conflict with cascading; sticking with 2.6.0
+
+ sbt assembly
+ scp target/scala-2.11/scald-mvp-assembly-0.1.0-SNAPSHOT.jar devbox:
+
+ # on cluster:
+ yarn jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar WordCount --hdfs --input hdfs:///user/bnewbold/dummy.txt
+
+later, using hadop command instead:
+
+ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.WordCountJob --hdfs --input hdfs:///user/bnewbold/dummy.txt --output hdfs:///user/bnewbold/test_scalding_out3
+
+helpful for debugging dependency woes:
+
+ sbt dependencyTree
+
+testing the spyglass example program (expect a table error):
+
+ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool example.SimpleHBaseSourceExample --hdfs --output hdfs:///user/bnewbold/spyglass_out_test --app.conf.path thing.conf --debug true
+ # org.apache.hadoop.hbase.TableNotFoundException: table_name
+
+running a spyglass job (gives a nullpointer exception):
+
+ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs --output hdfs:///user/bnewbold/spyglass_out_test --app.conf.path thing.conf
+
+ # Caused by: java.lang.NullPointerException
+ # at parallelai.spyglass.hbase.HBaseSource.<init>(HBaseSource.scala:48)
+ # at sandcrawler.HBaseRowCountJob.<init>(HBaseRowCountJob.scala:17)
+
+## Custom build
+
+in SpyGlass repo:
+
+ # This builds the new .jar and installs it in the (laptop local) ~/.m2
+ # repository
+ mvn clean install -U
+
+ # Copy that .jar (and associated pom.xml) over to where sbt can find it
+ mkdir -p ~/.sbt/preloaded/parallelai/
+ cp -r ~/.m2/repository/parallelai/parallelai.spyglass ~/.sbt/preloaded/parallelai/
+
+ # then build here
+ sbt assembly
+
+The medium-term plan here is to push the custom SpyGlass jar as a static maven
+repo to an archive.org item, and point build.sbt to that folder.