From 1ae7fd2f0c5661560b15be86614c2c4d41b21205 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 24 Aug 2018 13:39:02 -0700 Subject: commit notes from my laptop --- notes/backfill_scalding_rewrite.txt | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 notes/backfill_scalding_rewrite.txt (limited to 'notes/backfill_scalding_rewrite.txt') diff --git a/notes/backfill_scalding_rewrite.txt b/notes/backfill_scalding_rewrite.txt new file mode 100644 index 0000000..f5fb1d1 --- /dev/null +++ b/notes/backfill_scalding_rewrite.txt @@ -0,0 +1,22 @@ + +Background context needed: +- CDX text file format +- rough arch outline (what runs where) +- basic hadoop+hbase overview +- hbase schema +- quick look at hadoop and hbase web interfaces +- maybe quick re-profile? + +Plan/Steps: +x together: get *any* JVM map/reduce thing to build and run on cluster +x together: get something to build that talks to hbase +x basic JVM test infra; HBase mockup. "shopping" + => scalding and/or cascading +x simple hbase scan report generation (counts/stats) +x CDX parsing +- complete backfill script + +Spec for CDX backfill script: +- input is CDX, output to HBase table +- filter input before anything ("defensive"; only PDF, HTTP 200, size limit) +- reads HBase before insert; don't overwrite -- cgit v1.2.3