5 files changed, 236 insertions, 0 deletions
diff --git a/extra/blobs/README.md b/extra/blobs/README.md
new file mode 100644
index 0000000..555db92
--- /dev/null
+++ b/extra/blobs/README.md
@@ -0,0 +1,86 @@
+
+This document describes sandcrawler/fatcat use of "blob store" infrastructure
+for storing hundreds of millions of small files. For example, GROBID XML
+documents, jpeg thumbnails of PDFs.
+
+The basic feature requirements for this system are:
+
+- don't need preservation data resiliency: all this data is derived from
+  primary content, and is usually redundantly stored in Kafka topics (and thus
+  can be re-indexed to any server bounded only by throughput of the object
+  store service; Kafka is usually faster)
+- don't require SSDs or large amounts of RAM. Ability to accelerate performance
+  with additional RAM or moving indexes to SSD is nice, but we will be using
+  spinning disks for primary data storage
+- hundreds of millions or billions of objects, fetchable by a key we define
+- optional transparent compression (for text and XML)
+- typical object (file) size of 5-200 KBytes uncompressed, want to support up
+  to several MBytes
+- very simple internal API for GET/PUT (S3 API compatible is good)
+- ability to proxy to HTTP publicly for reads (eg, HTTP fall-back with no
+  authenticaiton), controllable by at least bucket granularity
+
+## Infrastructure
+
+`minio` was used initially, but did not scale well in number of files. We
+currently use seaweedfs. Any S3-compatible key/value store should work in
+theory. openlibrary.org has used WARCs in petabox items in the past. Actual
+cloud object stores tend to be expensive for this kind of use case.
+
+The facebook "haystack" project (and whitepaper) are good background reading
+describing one type of system that works well for this application.
+
+
+## Bucket / Folder Structure
+
+Currently we run everything off a single server, with no redundancy. There is
+no QA/prod distinction.
+
+Setting access control and doing bulk deletions is easiest at the bucket level,
+less easy at the folder level, most difficult at the suffix (file extention)
+level.
+
+For files that are derived from PDFs, we use the SHA-1 (in lower-case hex) of
+the source PDF to contruct keys. We generate nested "directories" from the hash
+to limit the number of keys per "directory" (even though in S3/seaweedfs there
+are no actual directories involved). The structure looks like:
+
+    <bucket>/<folder>/<byte0>/<byte1>/<sha1hex><suffix>
+
+Eg:
+
+    sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml
+
+The nesting is sort of a hold-over from minio (where files were actually
+on-disk), but seems worth keeping in case we end up switching storage systems
+in the future.
+
+## Existing Content
+
+sandcrawler: internal/controlled access to PDF derivatives
+    grobid: TEI-XML documents
+        extension: .tei.xml
+    text: raw pdftotext (or other text transform)
+        extension: .txt
+
+thumbnail: public bucket for thumbnail images
+    pdf: thumbnails from PDF files
+        extension: .180px.jpg
+
+## Proxy and URLs
+
+Internal HTTP access via:
+
+    http://wbgrp-svc169.us.archive.org:8333/<bucket>/<key>
+
+Public access via:
+
+    https://blobs.fatcat.wiki/<bucket>/<key>
+
+Eg:
+
+    http://wbgrp-svc169.us.archive.org:8333/testing/small.txt
+    http://wbgrp-svc169.us.archive.org:8333/sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml
+    https://blobs.fatcat.wiki/testing/small.txt
+    https://blobs.fatcat.wiki/thumbnail/pdf/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.180px.jpg
+
diff --git a/extra/blobs/minio/README.md b/extra/blobs/minio/README.md
new file mode 100644
index 0000000..d8f1c69
--- /dev/null
+++ b/extra/blobs/minio/README.md
@@ -0,0 +1,74 @@
+
+minio is used as an S3-compatible blob store. Initial use case is GROBID XML
+documents, addressed by the sha1 of the PDF file the XML was extracted from.
+
+Note that on the backend minio is just storing objects as files on disk.
+
+## Deploying minio Server
+
+It seems to be important to use a version of minio from at least December 2019
+era for on-disk compression to actually work.
+
+Currently install minio (and mc, the minio client) in prod by simply
+downloading the binaries and calling from systemd.
+
+## Buckets and Directories
+
+Hosts and buckets:
+
+    localhost:sandcrawler-dev
+        create locally for development (see below)
+
+    cluster:sandcrawler
+        main sandcrawler storage bucket, for GROBID output and other derivatives.
+        Note it isn't "sandcrawler-prod", for backwards compatibility reasons.
+
+    cluster:sandcrawler-qa
+        for, eg, testing on cluster servers
+
+    cluster:unpaywall
+        subset of sandcrawler content crawled due to unpaywall URLs;
+        potentially made publicly accessible
+
+Directory structure within sandcrawler buckets:
+
+    grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml
+        SHA1 (lower-case hex) of PDF that XML was extracted from
+
+Create new buckets like:
+
+    mc mb cluster/sandcrawler-qa
+
+## Development
+
+Run minio server locally, with non-persisted data:
+
+    docker run -p 9000:9000 minio/minio server /data
+
+Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and
+configure:
+
+    mc config host add localhost http://localhost:9000 minioadmin minioadmin
+
+Then create dev bucket:
+
+    mc mb --ignore-existing localhost/sandcrawler-dev
+
+A common "gotcha" with `mc` command is that it will first look for a local
+folder/directory with same name as the configured remote host, so make sure
+there isn't a `./localhost` folder.
+
+
+## Users
+
+Create a new readonly user like:
+
+    mc admin user add sandcrawler unpaywall $RANDOM_SECRET_KEY readonly
+
+Make a prefix within a bucket world-readable like:
+
+    mc policy set download cluster/unpaywall/grobid
+
+## Config
+
+    mc admin config set aitio compression extensions=.txt,.log,.csv,.json,.tsv,.pdf,.xml mime_types=text/csv,text/plain,application/json,application/xml,application/octet-stream,application/tei+xml
diff --git a/extra/blobs/minio/minio.conf b/extra/blobs/minio/minio.conf
new file mode 100644
index 0000000..2e93f9a
--- /dev/null
+++ b/extra/blobs/minio/minio.conf
@@ -0,0 +1,14 @@
+
+# Volume to be used for MinIO server.
+MINIO_VOLUMES="/sandcrawler-minio/data"
+# Use if you want to run MinIO on a custom port.
+MINIO_OPTS="--address :9000"
+# Access Key of the server.
+MINIO_ACCESS_KEY=REDACTED
+# Secret key of the server.
+MINIO_SECRET_KEY=REDACTED
+
+# may need to set these manually using `mc admin config get`, edit the JSON, then `set`
+MINIO_COMPRESS="on"
+MINIO_COMPRESS_EXTENSIONS=".txt,.log,.csv,.json,.tar,.xml,.bin,.pdf,.tsv"
+MINIO_COMPRESS_MIME_TYPES="text/*,application/json,application/xml,application/pdf,application/octet-stream"
diff --git a/extra/blobs/seaweedfs/README.md b/extra/blobs/seaweedfs/README.md
new file mode 100644
index 0000000..d19e9e0
--- /dev/null
+++ b/extra/blobs/seaweedfs/README.md
@@ -0,0 +1,9 @@
+
+## HOWTO: Create new bucket in SeaweedFS
+
+Log in to the seaweedfs VM.
+
+Run `weed shell` to start a shell, then:
+
+    bucket.create -name <bucket>
+
diff --git a/extra/blobs/tasks.md b/extra/blobs/tasks.md
new file mode 100644
index 0000000..beb765f
--- /dev/null
+++ b/extra/blobs/tasks.md
@@ -0,0 +1,53 @@
+
+## Backfill GROBID XML to Blob Store
+
+Initially ran this when spinning up new seaweedfs server to replace minio. At
+this time grobid persist worker was in db-only mode, as minio was too slow to
+accept uploads. Rough plan is to:
+
+1. run grobid persist worker from Kafka with a new temporary consumer group,
+   from the start of the GROBID output topic
+2. when it gets to end, stop the *regular* consumer group while this one is
+   still running. with temporary worker still running, at that point in time
+   entire topic should be in S3
+3. then reconfigure regular worker to db+s3 mode. halt the temporary worker,
+   restart the regular one with new config, run it indefinitely
+
+Consumer group isn't an arg, so just edit `persist_worker.py` and set it to
+`persist-grobid-seaweedfs`. Also needed to patch a bit so `--s3-only` mode
+didn't try to connect to postgresql.
+
+Commands:
+
+    ./sandcrawler_worker.py --kafka-hosts wbgrp-svc350.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only
+    => Consuming from kafka topic sandcrawler-prod.grobid-output-pg, group persist-grobid-seaweed
+    => run briefly, then kill
+
+On kafka-broker worker:
+
+    ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --reset-offsets --to-earliest --group persist-grobid-seaweed --topic sandcrawler-prod.grobid-output-pg --dry-run
+
+Then run 2x instances of worker (same command as above):
+
+    ./sandcrawler_worker.py --kafka-hosts wbgrp-svc350.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only
+
+At this point CPU-limited on this worker by the python processes (only 4 cores
+on this machine).
+
+Check in weed shell:
+
+    weed shell
+
+    > > fs.meta.cat buckets/sandcrawler/grobid/00/00/000068a76ab125389506e8834483c6ba4c73338a.tei.xml
+    [...]
+            "isGzipped": false
+    [...]
+            "mime": "application/xml",
+    [...]
+
+An open question is if we should have separate buckets per derive type. Eg, a
+GROBID XML bucket separate from thumbnails bucket. Or are prefix directories
+enough. Basically this comes down to whether we want things mixed together at
+the volume level. I think we should keep separate.
+
+Need to set the mimetype in the upload for gzip on XML?