aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--README.md60
1 files changed, 36 insertions, 24 deletions
diff --git a/README.md b/README.md
index a0eaa98..b29e397 100644
--- a/README.md
+++ b/README.md
@@ -6,15 +6,25 @@
\ooooooo| |___/\__,_|_| |_|\__,_|\___|_| \__,_| \_/\_/ |_|\___|_|
-This repo contains back-end python workers, scripts, hadoop jobs, luigi tasks,
-and other scripts and code for the Internet Archive web group's journal ingest
-pipeline. This code is of mixed quality and is mostly experimental. The goal
-for most of this is to submit metadata to [fatcat](https://fatcat.wiki), which
-is the more stable, maintained, and public-facing service.
-
-Code in this repository is potentially public! Not intended to accept public
-contributions for the most part. Much of this will not work outside the IA
-cluster environment.
+This repo contains back-end python workers, scripts, config files, and other
+stuff related to the Internet Archive web group's scholarly web preservation
+and processing pipeline. It is a complement to [fatcat](https://fatcat.wiki),
+which is an open catalog of research outputs, including preservation metadata.
+
+The sandcrawler part of the project deals with content crawled from the web
+into either web.archive.org or archive.org collections, and post-processing
+that content. For example, extracting text from PDF files, verifying mimetypes,
+and checking archival status. The resulting metadata ends up getting filtered,
+transformed, and pushed in to fatcat itself for public use.
+
+While code in this repository is public, it is mostly IA-specific and may not
+even run outside the IA data centers due to library dependencies and
+authentication needs. Code quality and documentation is generally poor compared
+to fatcat.
+
+As of December 2022, the best document to read for "getting started" in
+understanding the ingest system is `proposals/2019_ingest.md`, and then
+subsequent proposals expanding on that foundation.
Archive-specific deployment/production guides and ansible scripts at:
[journal-infra](https://git.archive.org/webgroup/journal-infra)
@@ -22,33 +32,35 @@ Archive-specific deployment/production guides and ansible scripts at:
## Repository Layout
-**./proposals/** design documentation and change proposals
-
**./python/** contains scripts and utilities for ingesting content from wayback
-and/or the web (via save-page-now API), and other processing pipelines
+and/or the web (via save-page-now API), and other processing pipelines. Most of
+the active code is in here. See included README (`./python/README.md`)
**./sql/** contains schema, queries, and backfill scripts for a Postgres SQL
database index (eg, file metadata, CDX, and GROBID status tables).
-**./pig/** contains a handful of Pig scripts, as well as some unittests
-implemented in python. Only rarely used.
+**./python_hadoop/** contains Hadoop streaming jobs written in python using the
+`mrjob` library. Still use the HBase backfill code path occasionally.
-**./scalding/** contains Hadoop jobs written in Scala using the Scalding
-framework. The intent is to write new non-trivial Hadoop jobs in Scala, which
-brings type safety and compiled performance. Mostly DEPRECATED.
+**./proposals/** design documentation and change proposals
-**./python_hadoop/** contains Hadoop streaming jobs written in python using the
-`mrjob` library. Mostly DEPRECATED.
+**./notes/ingest/** log of bulk crawls and metadata loads
+**./extra/docker/** docker-compose setup that may be useful for documentation
+(includes Kafka, PostgreSQL, etc)
-## Running Python Code
+**./.gitlab-ci.yml** current CI setup script, which documents dependencies
-You need python3.8 (or python3.6+ and `pyenv`) and `pipenv` to set up the
-environment. You may also need the debian packages `libpq-dev` and `
-`python-dev` to install some dependencies.
+**./pig/** contains a handful of Pig scripts, as well as some unittests
+implemented in python. Only rarely used.
+
+**./scalding/** contains Hadoop jobs written in Scala using the Scalding
+framework. The intent is to write new non-trivial Hadoop jobs in Scala, which
+brings type safety and compiled performance. Mostly DEPRECATED, this code has
+not been run in years.
-## Running Hadoop Jobs (DEPRECATED)
+## Running Python Hadoop Jobs
The `./please` python3 wrapper script is a helper for running jobs (python or
scalding) on the IA Hadoop cluster. You'll need to run the setup/dependency