aboutsummaryrefslogtreecommitdiffstats
path: root/blobs
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-05-28 14:27:03 -0700
committerBryan Newbold <bnewbold@archive.org>2020-05-28 14:27:05 -0700
commit5684e2e748e5ddc5962597711af1a63722c4ebde (patch)
tree8cb585b2aab5d606cd1fccf0efa86e3ffc14cc30 /blobs
parent79c3d690c12ad46d7ac7c2bfcded536dbbf5fe20 (diff)
downloadsandcrawler-5684e2e748e5ddc5962597711af1a63722c4ebde.tar.gz
sandcrawler-5684e2e748e5ddc5962597711af1a63722c4ebde.zip
move minio directory to 'blobs'
Part of migration from minio to seaweedfs, should be agnostic about what our actual blobstore (S3 API) is.
Diffstat (limited to 'blobs')
-rw-r--r--blobs/minio/README.md74
-rw-r--r--blobs/minio/minio.conf14
2 files changed, 88 insertions, 0 deletions
diff --git a/blobs/minio/README.md b/blobs/minio/README.md
new file mode 100644
index 0000000..d8f1c69
--- /dev/null
+++ b/blobs/minio/README.md
@@ -0,0 +1,74 @@
+
+minio is used as an S3-compatible blob store. Initial use case is GROBID XML
+documents, addressed by the sha1 of the PDF file the XML was extracted from.
+
+Note that on the backend minio is just storing objects as files on disk.
+
+## Deploying minio Server
+
+It seems to be important to use a version of minio from at least December 2019
+era for on-disk compression to actually work.
+
+Currently install minio (and mc, the minio client) in prod by simply
+downloading the binaries and calling from systemd.
+
+## Buckets and Directories
+
+Hosts and buckets:
+
+ localhost:sandcrawler-dev
+ create locally for development (see below)
+
+ cluster:sandcrawler
+ main sandcrawler storage bucket, for GROBID output and other derivatives.
+ Note it isn't "sandcrawler-prod", for backwards compatibility reasons.
+
+ cluster:sandcrawler-qa
+ for, eg, testing on cluster servers
+
+ cluster:unpaywall
+ subset of sandcrawler content crawled due to unpaywall URLs;
+ potentially made publicly accessible
+
+Directory structure within sandcrawler buckets:
+
+ grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml
+ SHA1 (lower-case hex) of PDF that XML was extracted from
+
+Create new buckets like:
+
+ mc mb cluster/sandcrawler-qa
+
+## Development
+
+Run minio server locally, with non-persisted data:
+
+ docker run -p 9000:9000 minio/minio server /data
+
+Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and
+configure:
+
+ mc config host add localhost http://localhost:9000 minioadmin minioadmin
+
+Then create dev bucket:
+
+ mc mb --ignore-existing localhost/sandcrawler-dev
+
+A common "gotcha" with `mc` command is that it will first look for a local
+folder/directory with same name as the configured remote host, so make sure
+there isn't a `./localhost` folder.
+
+
+## Users
+
+Create a new readonly user like:
+
+ mc admin user add sandcrawler unpaywall $RANDOM_SECRET_KEY readonly
+
+Make a prefix within a bucket world-readable like:
+
+ mc policy set download cluster/unpaywall/grobid
+
+## Config
+
+ mc admin config set aitio compression extensions=.txt,.log,.csv,.json,.tsv,.pdf,.xml mime_types=text/csv,text/plain,application/json,application/xml,application/octet-stream,application/tei+xml
diff --git a/blobs/minio/minio.conf b/blobs/minio/minio.conf
new file mode 100644
index 0000000..2e93f9a
--- /dev/null
+++ b/blobs/minio/minio.conf
@@ -0,0 +1,14 @@
+
+# Volume to be used for MinIO server.
+MINIO_VOLUMES="/sandcrawler-minio/data"
+# Use if you want to run MinIO on a custom port.
+MINIO_OPTS="--address :9000"
+# Access Key of the server.
+MINIO_ACCESS_KEY=REDACTED
+# Secret key of the server.
+MINIO_SECRET_KEY=REDACTED
+
+# may need to set these manually using `mc admin config get`, edit the JSON, then `set`
+MINIO_COMPRESS="on"
+MINIO_COMPRESS_EXTENSIONS=".txt,.log,.csv,.json,.tar,.xml,.bin,.pdf,.tsv"
+MINIO_COMPRESS_MIME_TYPES="text/*,application/json,application/xml,application/pdf,application/octet-stream"