From 9fda5323046cb3f87f0c7c7e07eca283ca52ce99 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 26 Dec 2019 21:14:20 -0800 Subject: update minio README --- minio/README.md | 52 ++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 42 insertions(+), 10 deletions(-) (limited to 'minio') diff --git a/minio/README.md b/minio/README.md index 3ce0f95..fd914f0 100644 --- a/minio/README.md +++ b/minio/README.md @@ -4,21 +4,52 @@ documents, addressed by the sha1 of the PDF file the XML was extracted from. Note that on the backend minio is just storing objects as files on disk. -## Buckets +## Buckets and Directories -Notable buckets, and structure/naming convention: +Hosts and buckets: - grobid/ - 2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml - SHA1 (lower-case hex) of PDF that XML was extracted from - unpaywall/grobid/ - 2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml + localhost:sandcrawler-dev + create locally for development (see below) + + cluster:sandcrawler + main sandcrawler storage bucket, for GROBID output and other derivatives. + Note it isn't "sandcrawler-prod", for backwards compatibility reasons. + + cluster:sandcrawler-qa + for, eg, testing on cluster servers + + cluster:unpaywall + subset of sandcrawler content crawled due to unpaywall URLs; + potentially made publicly accessible + +Directory structure within sandcrawler buckets: + + grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml SHA1 (lower-case hex) of PDF that XML was extracted from - (mirror of /grobid/ for which we crawled for unpaywall and made publicly accessible) Create new buckets like: - mc mb sandcrawler/grobid + mc mb cluster/sandcrawler-qa + +## Development + +Run minio server locally, with non-persisted data: + + docker run -p 9000:9000 minio/minio server /data + +Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and +configure: + + mc config host add localhost http://localhost:9000 minioadmin minioadmin + +Then create dev bucket: + + mc mb --ignore-existing localhost/sandcrawler-dev + +A common "gotcha" with `mc` command is that it will first look for a local +folder/directory with same name as the configured remote host, so make sure +there isn't a `./localhost` folder. + ## Users @@ -28,4 +59,5 @@ Create a new readonly user like: Make a prefix within a bucket world-readable like: - mc policy set download sandcrawler/unpaywall/grobid + mc policy set download cluster/unpaywall/grobid + -- cgit v1.2.3