aboutsummaryrefslogtreecommitdiffstats
path: root/scalding/scalding-debugging.md
blob: 5a5474223c0d9a5e2465a1ca3aba8dea4f8d2150 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
Quick tips for debugging scalding issues...

## Dependencies

Print the dependency graph (using the `sbt-dependency-graph` plugin):

    sbt dependencyTree

## Old Errors

At one phase, was getting `NullPointerException` errors when running tests or
in production, like:

    bnewbold@bnewbold-dev$ hadoop jar scald-mvp-assembly-0.1.0-SNAPSHOT.jar com.twitter.scalding.Tool sandcrawler.HBaseRowCountJob --hdfs --output hdfs:///user/bnewbold/spyglass_out_test
    Exception in thread "main" java.lang.Throwable: If you know what exactly caused this error, please consider contributing to GitHub via following link.
    https://github.com/twitter/scalding/wiki/Common-Exceptions-and-possible-reasons#javalangnullpointerexception
            at com.twitter.scalding.Tool$.main(Tool.scala:152)
            at com.twitter.scalding.Tool.main(Tool.scala)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
    Caused by: java.lang.reflect.InvocationTargetException
            at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
            at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
            at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
            at com.twitter.scalding.Job$.apply(Job.scala:44)
            at com.twitter.scalding.Tool.getJob(Tool.scala:49)
            at com.twitter.scalding.Tool.run(Tool.scala:68)
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
            at com.twitter.scalding.Tool$.main(Tool.scala:148)
            ... 6 more
    Caused by: java.lang.NullPointerException
            at parallelai.spyglass.hbase.HBaseSource.<init>(HBaseSource.scala:48)
            at sandcrawler.HBaseRowCountJob.<init>(HBaseRowCountJob.scala:14)
            ... 15 more

This was resolved by ensuring that all required parameters were being passed to
the `HBaseSource` constructor.

Another time, saw a bunch of `None.get` errors when running tests. These were
resolved by ensuring that the `HBaseSource` constructors had exactly identical
names and arguments (eg, table names and zookeeper quorums have to be exact
matches).

If you get:

    value toTypedPipe is not a member of cascading.pipe.Pipe

You probably need to [import some types][tdsl] from:

    import com.twitter.scalding.typed.TDsl._

[tdsl]: https://github.com/twitter/scalding/wiki/Type-safe-api-reference#interoperating-between-fields-api-and-type-safe-api

## Running Individual Tests

You can run a single test matching a string glob pattern like:

    sbt:sandcrawler> testOnly *CdxBackfill*

## Fields

Values of type `List[Fields]` are not printed in the expected way:

    $ scala -cp scala  -cp ~/.m2/repository/cascading/cascading-core/2.6.1/cascading-core-2.6.1.jar
    Welcome to Scala 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_31).
    Type in expressions for evaluation. Or try :help.

    scala> import cascading.tuple.Fields
    import cascading.tuple.Fields

    scala> val fields1 = new Fields("a", "b")
    fields1: cascading.tuple.Fields = 'a', 'b'

    scala> val fields2 = new Fields("c")
    fields2: cascading.tuple.Fields = 'c'

    scala> val allFields = List(fields1, fields2)
    allFields: List[cascading.tuple.Fields] = List('a', 'b', 'c')

    scala> allFields.length
    res0: Int = 2

## SpyGlass Column Selection

Two equivalent ways to specify `columns`/`column_families`:

    List("f", "file"),
    List(new Fields("c"), new Fields("size", "mimetype")),

    List("f", "file", "file")
    List(new Fields("c"), new Fields("size"), new Fields("mimetype")),