diff options
author | Bryan Newbold <bnewbold@archive.org> | 2022-12-23 15:52:02 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2022-12-23 15:52:02 -0800 |
commit | f3a721a9dce8801b78f7bc31e88dc912b0ec1dba (patch) | |
tree | fdae9373e78671d0031f83045e6c76de9ad616cf /blobs/minio | |
parent | 8c2c354a74064f2d66644af8f4e44d74bf322e1f (diff) | |
download | sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.tar.gz sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.zip |
move a bunch of top-level files/directories to ./extra/
Diffstat (limited to 'blobs/minio')
-rw-r--r-- | blobs/minio/README.md | 74 | ||||
-rw-r--r-- | blobs/minio/minio.conf | 14 |
2 files changed, 0 insertions, 88 deletions
diff --git a/blobs/minio/README.md b/blobs/minio/README.md deleted file mode 100644 index d8f1c69..0000000 --- a/blobs/minio/README.md +++ /dev/null @@ -1,74 +0,0 @@ - -minio is used as an S3-compatible blob store. Initial use case is GROBID XML -documents, addressed by the sha1 of the PDF file the XML was extracted from. - -Note that on the backend minio is just storing objects as files on disk. - -## Deploying minio Server - -It seems to be important to use a version of minio from at least December 2019 -era for on-disk compression to actually work. - -Currently install minio (and mc, the minio client) in prod by simply -downloading the binaries and calling from systemd. - -## Buckets and Directories - -Hosts and buckets: - - localhost:sandcrawler-dev - create locally for development (see below) - - cluster:sandcrawler - main sandcrawler storage bucket, for GROBID output and other derivatives. - Note it isn't "sandcrawler-prod", for backwards compatibility reasons. - - cluster:sandcrawler-qa - for, eg, testing on cluster servers - - cluster:unpaywall - subset of sandcrawler content crawled due to unpaywall URLs; - potentially made publicly accessible - -Directory structure within sandcrawler buckets: - - grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml - SHA1 (lower-case hex) of PDF that XML was extracted from - -Create new buckets like: - - mc mb cluster/sandcrawler-qa - -## Development - -Run minio server locally, with non-persisted data: - - docker run -p 9000:9000 minio/minio server /data - -Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and -configure: - - mc config host add localhost http://localhost:9000 minioadmin minioadmin - -Then create dev bucket: - - mc mb --ignore-existing localhost/sandcrawler-dev - -A common "gotcha" with `mc` command is that it will first look for a local -folder/directory with same name as the configured remote host, so make sure -there isn't a `./localhost` folder. - - -## Users - -Create a new readonly user like: - - mc admin user add sandcrawler unpaywall $RANDOM_SECRET_KEY readonly - -Make a prefix within a bucket world-readable like: - - mc policy set download cluster/unpaywall/grobid - -## Config - - mc admin config set aitio compression extensions=.txt,.log,.csv,.json,.tsv,.pdf,.xml mime_types=text/csv,text/plain,application/json,application/xml,application/octet-stream,application/tei+xml diff --git a/blobs/minio/minio.conf b/blobs/minio/minio.conf deleted file mode 100644 index 2e93f9a..0000000 --- a/blobs/minio/minio.conf +++ /dev/null @@ -1,14 +0,0 @@ - -# Volume to be used for MinIO server. -MINIO_VOLUMES="/sandcrawler-minio/data" -# Use if you want to run MinIO on a custom port. -MINIO_OPTS="--address :9000" -# Access Key of the server. -MINIO_ACCESS_KEY=REDACTED -# Secret key of the server. -MINIO_SECRET_KEY=REDACTED - -# may need to set these manually using `mc admin config get`, edit the JSON, then `set` -MINIO_COMPRESS="on" -MINIO_COMPRESS_EXTENSIONS=".txt,.log,.csv,.json,.tar,.xml,.bin,.pdf,.tsv" -MINIO_COMPRESS_MIME_TYPES="text/*,application/json,application/xml,application/pdf,application/octet-stream" |