diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-05-28 14:27:03 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-05-28 14:27:05 -0700 |
commit | 5684e2e748e5ddc5962597711af1a63722c4ebde (patch) | |
tree | 8cb585b2aab5d606cd1fccf0efa86e3ffc14cc30 /blobs/minio | |
parent | 79c3d690c12ad46d7ac7c2bfcded536dbbf5fe20 (diff) | |
download | sandcrawler-5684e2e748e5ddc5962597711af1a63722c4ebde.tar.gz sandcrawler-5684e2e748e5ddc5962597711af1a63722c4ebde.zip |
move minio directory to 'blobs'
Part of migration from minio to seaweedfs, should be agnostic about what
our actual blobstore (S3 API) is.
Diffstat (limited to 'blobs/minio')
-rw-r--r-- | blobs/minio/README.md | 74 | ||||
-rw-r--r-- | blobs/minio/minio.conf | 14 |
2 files changed, 88 insertions, 0 deletions
diff --git a/blobs/minio/README.md b/blobs/minio/README.md new file mode 100644 index 0000000..d8f1c69 --- /dev/null +++ b/blobs/minio/README.md @@ -0,0 +1,74 @@ + +minio is used as an S3-compatible blob store. Initial use case is GROBID XML +documents, addressed by the sha1 of the PDF file the XML was extracted from. + +Note that on the backend minio is just storing objects as files on disk. + +## Deploying minio Server + +It seems to be important to use a version of minio from at least December 2019 +era for on-disk compression to actually work. + +Currently install minio (and mc, the minio client) in prod by simply +downloading the binaries and calling from systemd. + +## Buckets and Directories + +Hosts and buckets: + + localhost:sandcrawler-dev + create locally for development (see below) + + cluster:sandcrawler + main sandcrawler storage bucket, for GROBID output and other derivatives. + Note it isn't "sandcrawler-prod", for backwards compatibility reasons. + + cluster:sandcrawler-qa + for, eg, testing on cluster servers + + cluster:unpaywall + subset of sandcrawler content crawled due to unpaywall URLs; + potentially made publicly accessible + +Directory structure within sandcrawler buckets: + + grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml + SHA1 (lower-case hex) of PDF that XML was extracted from + +Create new buckets like: + + mc mb cluster/sandcrawler-qa + +## Development + +Run minio server locally, with non-persisted data: + + docker run -p 9000:9000 minio/minio server /data + +Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and +configure: + + mc config host add localhost http://localhost:9000 minioadmin minioadmin + +Then create dev bucket: + + mc mb --ignore-existing localhost/sandcrawler-dev + +A common "gotcha" with `mc` command is that it will first look for a local +folder/directory with same name as the configured remote host, so make sure +there isn't a `./localhost` folder. + + +## Users + +Create a new readonly user like: + + mc admin user add sandcrawler unpaywall $RANDOM_SECRET_KEY readonly + +Make a prefix within a bucket world-readable like: + + mc policy set download cluster/unpaywall/grobid + +## Config + + mc admin config set aitio compression extensions=.txt,.log,.csv,.json,.tsv,.pdf,.xml mime_types=text/csv,text/plain,application/json,application/xml,application/octet-stream,application/tei+xml diff --git a/blobs/minio/minio.conf b/blobs/minio/minio.conf new file mode 100644 index 0000000..2e93f9a --- /dev/null +++ b/blobs/minio/minio.conf @@ -0,0 +1,14 @@ + +# Volume to be used for MinIO server. +MINIO_VOLUMES="/sandcrawler-minio/data" +# Use if you want to run MinIO on a custom port. +MINIO_OPTS="--address :9000" +# Access Key of the server. +MINIO_ACCESS_KEY=REDACTED +# Secret key of the server. +MINIO_SECRET_KEY=REDACTED + +# may need to set these manually using `mc admin config get`, edit the JSON, then `set` +MINIO_COMPRESS="on" +MINIO_COMPRESS_EXTENSIONS=".txt,.log,.csv,.json,.tar,.xml,.bin,.pdf,.tsv" +MINIO_COMPRESS_MIME_TYPES="text/*,application/json,application/xml,application/pdf,application/octet-stream" |