aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
blob: e53e7759fda2baa5ca2d6fb61b548e1a56830025 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

                                      _                         _           
    __________    ___  __ _ _ __   __| | ___ _ __ __ ___      _| | ___ _ __ 
    \         |  / __|/ _` | '_ \ / _` |/ __| '__/ _` \ \ /\ / / |/ _ \ '__|
     \        |  \__ \ (_| | | | | (_| | (__| | | (_| |\ V  V /| |  __/ |   
      \ooooooo|  |___/\__,_|_| |_|\__,_|\___|_|  \__,_| \_/\_/ |_|\___|_|   


This repo contains hadoop tasks (mapreduce and pig), luigi jobs, and other
scripts and code for the internet archive (web group) journal ingest pipeline.

This repository is potentially public.

Archive-specific deployment/production guides and ansible scripts at:
[journal-infra](https://git.archive.org/bnewbold/journal-infra)

## Python Setup

Pretty much everything here uses python/pipenv. To setup your environment for
this, and python in general:

    # libjpeg-dev is for some wayback/pillow stuff
    sudo apt install -y python3-dev python3-pip python3-wheel libjpeg-dev build-essential
    pip3 install --user pipenv

On macOS:

    brew install libjpeg pipenv

Each directory has it's own environment. Do something like:

    cd mapreduce
    pipenv install --dev
    pipenv shell

## Possible Issues with Setup

Bryan had `~/.local/bin` in his `$PATH`, and that seemed to make everything
work.