From ccc181355d4bf54e9c0018ccc440e302763697cb Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 1 Oct 2020 19:51:04 -0700 Subject: update README (public) --- README.md | 44 +++++++++++++++++++++++++++----------------- 1 file changed, 27 insertions(+), 17 deletions(-) (limited to 'README.md') diff --git a/README.md b/README.md index 737027f..768bd00 100644 --- a/README.md +++ b/README.md @@ -6,40 +6,50 @@ \ooooooo| |___/\__,_|_| |_|\__,_|\___|_| \__,_| \_/\_/ |_|\___|_| -This repo contains hadoop jobs, luigi tasks, and other scripts and code for the -internet archive web group's journal ingest pipeline. +This repo contains back-end python workers, scripts, hadoop jobs, luigi tasks, +and other scripts and code for the Internet Archive web group's journal ingest +pipeline. This code is of mixed quality and is mostly experimental. The goal +for most of this is to submit metadata to [fatcat](https://fatcat.wiki), which +is the more stable, maintained, and public-facing service. -Code in tihs repository is potentially public! +Code in this repository is potentially public! Not intented to accept public +contributions for the most part. Much of this will not work outside the IA +cluster environment. Archive-specific deployment/production guides and ansible scripts at: [journal-infra](https://git.archive.org/webgroup/journal-infra) -**./python/** contains scripts and utilities for + +## Repository Layout + +**./proposals/** design documentation and change proposals + +**./python/** contains scripts and utilities for ingesting content from wayback +and/or the web (via save-page-now API), and other processing pipelines **./sql/** contains schema, queries, and backfill scripts for a Postgres SQL database index (eg, file metadata, CDX, and GROBID status tables). -**./minio/** contains docs on how to setup and use a minio S3-compatible blob -store (eg, for GROBID XML output) +**./pig/** contains a handful of Pig scripts, as well as some unittests +implemented in python. Only rarely used. **./scalding/** contains Hadoop jobs written in Scala using the Scalding framework. The intent is to write new non-trivial Hadoop jobs in Scala, which -brings type safety and compiled performance. +brings type safety and compiled performance. Mostly DEPRECATED. **./python_hadoop/** contains Hadoop streaming jobs written in python using the -`mrjob` library. Considered deprecated! +`mrjob` library. Mostly DEPRECATED. -**./pig/** contains a handful of Pig scripts, as well as some unittests -implemented in python. - -## Running Hadoop Jobs - -The `./please` python3 wrapper script is a helper for running jobs (python or -scalding) on the IA Hadoop cluster. You'll need to run the setup/dependency -tasks first; see README files in subdirectories. ## Running Python Code You need python3.7 (or python3.6+ and `pyenv`) and `pipenv` to set up the environment. You may also need the debian packages `libpq-dev` and ` -python-dev` to install some dependencies. +`python-dev` to install some dependencies. + + +## Running Hadoop Jobs (DEPRECATED) + +The `./please` python3 wrapper script is a helper for running jobs (python or +scalding) on the IA Hadoop cluster. You'll need to run the setup/dependency +tasks first; see README files in subdirectories. -- cgit v1.2.3