aboutsummaryrefslogtreecommitdiffstats
path: root/scalding/src/main
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-08-08 12:14:16 -0700
committerBryan Newbold <bnewbold@archive.org>2018-08-08 12:14:21 -0700
commit71be2e685848a31888811e2e398e769f7e0486c2 (patch)
tree58026a7a473a7e301db8ba2293970f4d294cd2a0 /scalding/src/main
parentc4db53036eac90841eb4f970b77db8c1677ef75b (diff)
downloadsandcrawler-71be2e685848a31888811e2e398e769f7e0486c2.tar.gz
sandcrawler-71be2e685848a31888811e2e398e769f7e0486c2.zip
row-count: require f:c, not file:size
I tried using the empty List() and got a test failure, so it seems like we do need to specific *some* field here. file:size gets populated by the extraction job, not the backfill job, so I had been miscounting table sizes (counting only the number of GROBID extracted items, not rows in the table). TODO: count on key or no column, not f:c
Diffstat (limited to 'scalding/src/main')
-rw-r--r--scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala2
1 files changed, 1 insertions, 1 deletions
diff --git a/scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala b/scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala
index 4c3de33..5c7954a 100644
--- a/scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala
+++ b/scalding/src/main/scala/sandcrawler/HBaseRowCountJob.scala
@@ -30,7 +30,7 @@ object HBaseRowCountJob {
HBaseBuilder.build(
hbaseTable,
zookeeperHosts,
- List("file:size"),
+ List("f:c"),
SourceMode.SCAN_ALL)
}
}