diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-08-08 12:14:16 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-08-08 12:14:21 -0700 |
commit | 71be2e685848a31888811e2e398e769f7e0486c2 (patch) | |
tree | 58026a7a473a7e301db8ba2293970f4d294cd2a0 /scalding | |
parent | c4db53036eac90841eb4f970b77db8c1677ef75b (diff) | |
download | sandcrawler-71be2e685848a31888811e2e398e769f7e0486c2.tar.gz sandcrawler-71be2e685848a31888811e2e398e769f7e0486c2.zip |
row-count: require f:c, not file:size
I tried using the empty List() and got a test failure, so it seems like
we do need to specific *some* field here.
file:size gets populated by the extraction job, not the backfill job, so
I had been miscounting table sizes (counting only the number of GROBID
extracted items, not rows in the table).
TODO: count on key or no column, not f:c
Diffstat (limited to 'scalding')
-rw-r--r-- | scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala b/scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala index 4c3de33..5c7954a 100644 --- a/scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala +++ b/scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala @@ -30,7 +30,7 @@ object HBaseRowCountJob { HBaseBuilder.build( hbaseTable, zookeeperHosts, - List("file:size"), + List("f:c"), SourceMode.SCAN_ALL) } } |