aboutsummaryrefslogtreecommitdiffstats
path: root/mapreduce
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-04-04 13:31:59 -0700
committerBryan Newbold <bnewbold@archive.org>2018-04-04 13:31:59 -0700
commitb8cf9f6ea726970775ea49a44b243ad158d14a7c (patch)
treef46314ef9d4dddf92a25ce21b2ea47ddddfa73e9 /mapreduce
parent7ecb1334506cab470399d9f493e5d8a651c9c2cc (diff)
downloadsandcrawler-b8cf9f6ea726970775ea49a44b243ad158d14a7c.tar.gz
sandcrawler-b8cf9f6ea726970775ea49a44b243ad158d14a7c.zip
README/TODO updates
Diffstat (limited to 'mapreduce')
-rw-r--r--mapreduce/README.md17
1 files changed, 9 insertions, 8 deletions
diff --git a/mapreduce/README.md b/mapreduce/README.md
index b063fba..3cff9f1 100644
--- a/mapreduce/README.md
+++ b/mapreduce/README.md
@@ -1,14 +1,11 @@
-## Development and Testing
-
-Requires (eg, via `apt`):
+Hadoop streaming map/reduce jobs written in python using the mrjob library.
-- libjpeg-dev
+## Development and Testing
-Install pipenv system-wide if you don't have it:
+System dependencies in addition to `../README.md`:
- # or use apt, homebrew, etc
- sudo pip3 install pipenv
+- `libjpeg-dev` (for wayback libraries)
Run the tests with:
@@ -16,7 +13,11 @@ Run the tests with:
TODO: GROBID and HBase during development?
-## Backfill
+## Extraction Task
+
+TODO:
+
+## Backfill Task
An example actually connecting to HBase from a local machine, with thrift
running on a devbox: