_ _
__________ ___ __ _ _ __ __| | ___ _ __ __ ___ _| | ___ _ __
\ | / __|/ _` | '_ \ / _` |/ __| '__/ _` \ \ /\ / / |/ _ \ '__|
\ | \__ \ (_| | | | | (_| | (__| | | (_| |\ V V /| | __/ |
\ooooooo| |___/\__,_|_| |_|\__,_|\___|_| \__,_| \_/\_/ |_|\___|_|
This repo contains hadoop jobs, luigi tasks, and other scripts and code for the internet archive web group's journal ingest pipeline.
Code in tihs repository is potentially public!
Archive-specific deployment/production guides and ansible scripts at: journal-infra
./python/ contains scripts and utilities for
./sql/ contains schema, queries, and backfill scripts for a Postgres SQL database index (eg, file metadata, CDX, and GROBID status tables).
./minio/ contains docs on how to setup and use a minio S3-compatible blob store (eg, for GROBID XML output)
./scalding/ contains Hadoop jobs written in Scala using the Scalding framework. The intent is to write new non-trivial Hadoop jobs in Scala, which brings type safety and compiled performance.
./python_hadoop/ contains Hadoop streaming jobs written in python using the
mrjob
library. Considered deprecated!
./pig/ contains a handful of Pig scripts, as well as some unittests implemented in python.
Running Hadoop Jobs
The ./please
python3 wrapper script is a helper for running jobs (python or
scalding) on the IA Hadoop cluster. You'll need to run the setup/dependency
tasks first; see README files in subdirectories.
Running Python Code
You need python3.5 (or python3.6+ and pyenv
) and pipenv
to set up the
environment. You may also need the debian packages libpq-dev
and python-dev
to install some dependencies.