This repo contains hadoop tasks (mapreduce and pig), luigi jobs, and other scripts and code for the journal ingest pipeline. This repository is potentially public. Maybe we'll rename it "sandcrawler"? Archive-specific deployment/production guides and ansible scripts at: [journal-infra](https://git.archive.org/bnewbold/journal-infra)