diff options
| author | Bryan Newbold <bnewbold@archive.org> | 2018-04-04 13:31:59 -0700 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@archive.org> | 2018-04-04 13:31:59 -0700 | 
| commit | b8cf9f6ea726970775ea49a44b243ad158d14a7c (patch) | |
| tree | f46314ef9d4dddf92a25ce21b2ea47ddddfa73e9 /mapreduce | |
| parent | 7ecb1334506cab470399d9f493e5d8a651c9c2cc (diff) | |
| download | sandcrawler-b8cf9f6ea726970775ea49a44b243ad158d14a7c.tar.gz sandcrawler-b8cf9f6ea726970775ea49a44b243ad158d14a7c.zip | |
README/TODO updates
Diffstat (limited to 'mapreduce')
| -rw-r--r-- | mapreduce/README.md | 17 | 
1 files changed, 9 insertions, 8 deletions
| diff --git a/mapreduce/README.md b/mapreduce/README.md index b063fba..3cff9f1 100644 --- a/mapreduce/README.md +++ b/mapreduce/README.md @@ -1,14 +1,11 @@ -## Development and Testing - -Requires (eg, via `apt`): +Hadoop streaming map/reduce jobs written in python using the mrjob library. -- libjpeg-dev +## Development and Testing -Install pipenv system-wide if you don't have it: +System dependencies in addition to `../README.md`: -    # or use apt, homebrew, etc -    sudo pip3 install pipenv +- `libjpeg-dev` (for wayback libraries)  Run the tests with: @@ -16,7 +13,11 @@ Run the tests with:  TODO: GROBID and HBase during development? -## Backfill +## Extraction Task + +TODO: + +## Backfill Task  An example actually connecting to HBase from a local machine, with thrift  running on a devbox: | 
