update README with new folders

author: Bryan Newbold <bnewbold@archive.org> 2019-09-25 17:55:04 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2019-09-25 17:55:04 -0700
commit: 0e898d2854388f1798a6a6d537b4ec6413762f1b (patch)
tree: 0d3d7e5ae3eeb26c7e3e76f9e53a0d49223c4517
parent: 353dc0c2954d9f834fcccb49558728e326abca5b (diff)
download: sandcrawler-0e898d2854388f1798a6a6d537b4ec6413762f1b.tar.gz
sandcrawler-0e898d2854388f1798a6a6d537b4ec6413762f1b.zip
1 files changed, 10 insertions, 4 deletions
diff --git a/README.md b/README.md
index a6eeb5b..386149d 100644
--- a/README.md
+++ b/README.md
@@ -14,15 +14,21 @@ Code in tihs repository is potentially public!
 Archive-specific deployment/production guides and ansible scripts at:
 [journal-infra](https://git.archive.org/webgroup/journal-infra)
 
-**./python/** contains Hadoop streaming jobs written in python using the
-`mrjob` library. Most notably, the **extraction** scripts, which fetch PDF
-files from wayback/petabox, process them with GROBID, and store the result in
-HBase.
+**./python/** contains scripts and utilities for 
+
+**./sql/** contains schema, queries, and backfill scripts for a Postgres SQL
+database index (eg, file metadata, CDX, and GROBID status tables).
+
+**./minio/** contains docs on how to setup and use a minio S3-compatible blob
+store (eg, for GROBID XML output)
 
 **./scalding/** contains Hadoop jobs written in Scala using the Scalding
 framework. The intent is to write new non-trivial Hadoop jobs in Scala, which
 brings type safety and compiled performance.
 
+**./python_hadoop/** contains Hadoop streaming jobs written in python using the
+`mrjob` library. Considered deprecated!
+
 **./pig/** contains a handful of Pig scripts, as well as some unittests
 implemented in python.
author	Bryan Newbold <bnewbold@archive.org>	2019-09-25 17:55:04 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2019-09-25 17:55:04 -0700
commit	0e898d2854388f1798a6a6d537b4ec6413762f1b (patch)
tree	0d3d7e5ae3eeb26c7e3e76f9e53a0d49223c4517
parent	353dc0c2954d9f834fcccb49558728e326abca5b (diff)
download	sandcrawler-0e898d2854388f1798a6a6d537b4ec6413762f1b.tar.gz sandcrawler-0e898d2854388f1798a6a6d537b4ec6413762f1b.zip