aboutsummaryrefslogtreecommitdiffstats
                                  _                         _           
__________    ___  __ _ _ __   __| | ___ _ __ __ ___      _| | ___ _ __ 
\         |  / __|/ _` | '_ \ / _` |/ __| '__/ _` \ \ /\ / / |/ _ \ '__|
 \        |  \__ \ (_| | | | | (_| | (__| | | (_| |\ V  V /| |  __/ |   
  \ooooooo|  |___/\__,_|_| |_|\__,_|\___|_|  \__,_| \_/\_/ |_|\___|_|

This repo contains hadoop tasks (mapreduce and pig), luigi jobs, and other scripts and code for the internet archive (web group) journal ingest pipeline.

This repository is potentially public.

Archive-specific deployment/production guides and ansible scripts at: journal-infra

Python Setup

Pretty much everything here uses python/pipenv. To setup your environment for this, and python in general:

# libjpeg-dev is for some wayback/pillow stuff
sudo apt install -y python3-dev python3-pip python3-wheel libjpeg-dev build-essentials
pip3 install --user pipenv

On macOS:

brew install libjpeg pipenv

Each directory has it's own environment. Do something like:

cd mapreduce
pipenv install --dev
pipenv shell

Possible Issues with Setup

Bryan had ~/.local/bin in his $PATH, and that seemed to make everything work.