blob: dfa5b26f7b89981158d68b2e12eb0af68dfa40f6 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
_ _
__________ ___ __ _ _ __ __| | ___ _ __ __ ___ _| | ___ _ __
\ | / __|/ _` | '_ \ / _` |/ __| '__/ _` \ \ /\ / / |/ _ \ '__|
\ | \__ \ (_| | | | | (_| | (__| | | (_| |\ V V /| | __/ |
\ooooooo| |___/\__,_|_| |_|\__,_|\___|_| \__,_| \_/\_/ |_|\___|_|
This repo contains hadoop jobs, luigi tasks, and other scripts and code for the
internet archive web group's journal ingest pipeline.
Code in tihs repository is potentially public!
Archive-specific deployment/production guides and ansible scripts at:
[journal-infra](https://git.archive.org/webgroup/journal-infra)
**./python/** contains scripts and utilities for
**./sql/** contains schema, queries, and backfill scripts for a Postgres SQL
database index (eg, file metadata, CDX, and GROBID status tables).
**./minio/** contains docs on how to setup and use a minio S3-compatible blob
store (eg, for GROBID XML output)
**./scalding/** contains Hadoop jobs written in Scala using the Scalding
framework. The intent is to write new non-trivial Hadoop jobs in Scala, which
brings type safety and compiled performance.
**./python_hadoop/** contains Hadoop streaming jobs written in python using the
`mrjob` library. Considered deprecated!
**./pig/** contains a handful of Pig scripts, as well as some unittests
implemented in python.
## Running Hadoop Jobs
The `./please` python3 wrapper script is a helper for running jobs (python or
scalding) on the IA Hadoop cluster. You'll need to run the setup/dependency
tasks first; see README files in subdirectories.
## Running Python Code
You need python3.5 (or python3.6+ and `pyenv`) and `pipenv` to set up the
environment. You may also need the debian packages `libpq-dev` and `
python-dev` to install some dependencies.
|