aboutsummaryrefslogtreecommitdiffstats
path: root/notes/backfill_scalding_rewrite.txt
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-08-24 13:39:02 -0700
committerBryan Newbold <bnewbold@archive.org>2018-08-24 13:39:02 -0700
commit1ae7fd2f0c5661560b15be86614c2c4d41b21205 (patch)
tree71ed116cfbc65562bfcbd2d913402c098c23c1df /notes/backfill_scalding_rewrite.txt
parentf21bf5c66382a475a5127e449d05a75ba41a9a25 (diff)
downloadsandcrawler-1ae7fd2f0c5661560b15be86614c2c4d41b21205.tar.gz
sandcrawler-1ae7fd2f0c5661560b15be86614c2c4d41b21205.zip
commit notes from my laptop
Diffstat (limited to 'notes/backfill_scalding_rewrite.txt')
-rw-r--r--notes/backfill_scalding_rewrite.txt22
1 files changed, 22 insertions, 0 deletions
diff --git a/notes/backfill_scalding_rewrite.txt b/notes/backfill_scalding_rewrite.txt
new file mode 100644
index 0000000..f5fb1d1
--- /dev/null
+++ b/notes/backfill_scalding_rewrite.txt
@@ -0,0 +1,22 @@
+
+Background context needed:
+- CDX text file format
+- rough arch outline (what runs where)
+- basic hadoop+hbase overview
+- hbase schema
+- quick look at hadoop and hbase web interfaces
+- maybe quick re-profile?
+
+Plan/Steps:
+x together: get *any* JVM map/reduce thing to build and run on cluster
+x together: get something to build that talks to hbase
+x basic JVM test infra; HBase mockup. "shopping"
+ => scalding and/or cascading
+x simple hbase scan report generation (counts/stats)
+x CDX parsing
+- complete backfill script
+
+Spec for CDX backfill script:
+- input is CDX, output to HBase table
+- filter input before anything ("defensive"; only PDF, HTTP 200, size limit)
+- reads HBase before insert; don't overwrite