From 5d3a3d4a4ab0b92da78e8f0dfb50ff27ea88039f Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 23 Dec 2022 15:48:29 -0800 Subject: update README for Dec 2022 --- README.md | 60 ++++++++++++++++++++++++++++++++++++------------------------ 1 file changed, 36 insertions(+), 24 deletions(-) (limited to 'README.md') diff --git a/README.md b/README.md index a0eaa98..b29e397 100644 --- a/README.md +++ b/README.md @@ -6,15 +6,25 @@ \ooooooo| |___/\__,_|_| |_|\__,_|\___|_| \__,_| \_/\_/ |_|\___|_| -This repo contains back-end python workers, scripts, hadoop jobs, luigi tasks, -and other scripts and code for the Internet Archive web group's journal ingest -pipeline. This code is of mixed quality and is mostly experimental. The goal -for most of this is to submit metadata to [fatcat](https://fatcat.wiki), which -is the more stable, maintained, and public-facing service. - -Code in this repository is potentially public! Not intended to accept public -contributions for the most part. Much of this will not work outside the IA -cluster environment. +This repo contains back-end python workers, scripts, config files, and other +stuff related to the Internet Archive web group's scholarly web preservation +and processing pipeline. It is a complement to [fatcat](https://fatcat.wiki), +which is an open catalog of research outputs, including preservation metadata. + +The sandcrawler part of the project deals with content crawled from the web +into either web.archive.org or archive.org collections, and post-processing +that content. For example, extracting text from PDF files, verifying mimetypes, +and checking archival status. The resulting metadata ends up getting filtered, +transformed, and pushed in to fatcat itself for public use. + +While code in this repository is public, it is mostly IA-specific and may not +even run outside the IA data centers due to library dependencies and +authentication needs. Code quality and documentation is generally poor compared +to fatcat. + +As of December 2022, the best document to read for "getting started" in +understanding the ingest system is `proposals/2019_ingest.md`, and then +subsequent proposals expanding on that foundation. Archive-specific deployment/production guides and ansible scripts at: [journal-infra](https://git.archive.org/webgroup/journal-infra) @@ -22,33 +32,35 @@ Archive-specific deployment/production guides and ansible scripts at: ## Repository Layout -**./proposals/** design documentation and change proposals - **./python/** contains scripts and utilities for ingesting content from wayback -and/or the web (via save-page-now API), and other processing pipelines +and/or the web (via save-page-now API), and other processing pipelines. Most of +the active code is in here. See included README (`./python/README.md`) **./sql/** contains schema, queries, and backfill scripts for a Postgres SQL database index (eg, file metadata, CDX, and GROBID status tables). -**./pig/** contains a handful of Pig scripts, as well as some unittests -implemented in python. Only rarely used. +**./python_hadoop/** contains Hadoop streaming jobs written in python using the +`mrjob` library. Still use the HBase backfill code path occasionally. -**./scalding/** contains Hadoop jobs written in Scala using the Scalding -framework. The intent is to write new non-trivial Hadoop jobs in Scala, which -brings type safety and compiled performance. Mostly DEPRECATED. +**./proposals/** design documentation and change proposals -**./python_hadoop/** contains Hadoop streaming jobs written in python using the -`mrjob` library. Mostly DEPRECATED. +**./notes/ingest/** log of bulk crawls and metadata loads +**./extra/docker/** docker-compose setup that may be useful for documentation +(includes Kafka, PostgreSQL, etc) -## Running Python Code +**./.gitlab-ci.yml** current CI setup script, which documents dependencies -You need python3.8 (or python3.6+ and `pyenv`) and `pipenv` to set up the -environment. You may also need the debian packages `libpq-dev` and ` -`python-dev` to install some dependencies. +**./pig/** contains a handful of Pig scripts, as well as some unittests +implemented in python. Only rarely used. + +**./scalding/** contains Hadoop jobs written in Scala using the Scalding +framework. The intent is to write new non-trivial Hadoop jobs in Scala, which +brings type safety and compiled performance. Mostly DEPRECATED, this code has +not been run in years. -## Running Hadoop Jobs (DEPRECATED) +## Running Python Hadoop Jobs The `./please` python3 wrapper script is a helper for running jobs (python or scalding) on the IA Hadoop cluster. You'll need to run the setup/dependency -- cgit v1.2.3