diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-09-25 17:55:04 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2019-09-25 17:55:04 -0700 |
commit | 0e898d2854388f1798a6a6d537b4ec6413762f1b (patch) | |
tree | 0d3d7e5ae3eeb26c7e3e76f9e53a0d49223c4517 | |
parent | 353dc0c2954d9f834fcccb49558728e326abca5b (diff) | |
download | sandcrawler-0e898d2854388f1798a6a6d537b4ec6413762f1b.tar.gz sandcrawler-0e898d2854388f1798a6a6d537b4ec6413762f1b.zip |
update README with new folders
-rw-r--r-- | README.md | 14 |
1 files changed, 10 insertions, 4 deletions
@@ -14,15 +14,21 @@ Code in tihs repository is potentially public! Archive-specific deployment/production guides and ansible scripts at: [journal-infra](https://git.archive.org/webgroup/journal-infra) -**./python/** contains Hadoop streaming jobs written in python using the -`mrjob` library. Most notably, the **extraction** scripts, which fetch PDF -files from wayback/petabox, process them with GROBID, and store the result in -HBase. +**./python/** contains scripts and utilities for + +**./sql/** contains schema, queries, and backfill scripts for a Postgres SQL +database index (eg, file metadata, CDX, and GROBID status tables). + +**./minio/** contains docs on how to setup and use a minio S3-compatible blob +store (eg, for GROBID XML output) **./scalding/** contains Hadoop jobs written in Scala using the Scalding framework. The intent is to write new non-trivial Hadoop jobs in Scala, which brings type safety and compiled performance. +**./python_hadoop/** contains Hadoop streaming jobs written in python using the +`mrjob` library. Considered deprecated! + **./pig/** contains a handful of Pig scripts, as well as some unittests implemented in python. |