blob: e53e7759fda2baa5ca2d6fb61b548e1a56830025 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
_ _
__________ ___ __ _ _ __ __| | ___ _ __ __ ___ _| | ___ _ __
\ | / __|/ _` | '_ \ / _` |/ __| '__/ _` \ \ /\ / / |/ _ \ '__|
\ | \__ \ (_| | | | | (_| | (__| | | (_| |\ V V /| | __/ |
\ooooooo| |___/\__,_|_| |_|\__,_|\___|_| \__,_| \_/\_/ |_|\___|_|
This repo contains hadoop tasks (mapreduce and pig), luigi jobs, and other
scripts and code for the internet archive (web group) journal ingest pipeline.
This repository is potentially public.
Archive-specific deployment/production guides and ansible scripts at:
[journal-infra](https://git.archive.org/bnewbold/journal-infra)
## Python Setup
Pretty much everything here uses python/pipenv. To setup your environment for
this, and python in general:
# libjpeg-dev is for some wayback/pillow stuff
sudo apt install -y python3-dev python3-pip python3-wheel libjpeg-dev build-essential
pip3 install --user pipenv
On macOS:
brew install libjpeg pipenv
Each directory has it's own environment. Do something like:
cd mapreduce
pipenv install --dev
pipenv shell
## Possible Issues with Setup
Bryan had `~/.local/bin` in his `$PATH`, and that seemed to make everything
work.
|