aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-12-26 21:14:20 -0800
committerBryan Newbold <bnewbold@archive.org>2020-01-02 18:12:58 -0800
commit9fda5323046cb3f87f0c7c7e07eca283ca52ce99 (patch)
tree37b8d53777ca320db52cf67af15170f7c571cac2
parent33f6744b56a9ca7b01cb4ed7b80bdf70a972ffa8 (diff)
downloadsandcrawler-9fda5323046cb3f87f0c7c7e07eca283ca52ce99.tar.gz
sandcrawler-9fda5323046cb3f87f0c7c7e07eca283ca52ce99.zip
update minio README
-rw-r--r--minio/README.md52
1 files changed, 42 insertions, 10 deletions
diff --git a/minio/README.md b/minio/README.md
index 3ce0f95..fd914f0 100644
--- a/minio/README.md
+++ b/minio/README.md
@@ -4,21 +4,52 @@ documents, addressed by the sha1 of the PDF file the XML was extracted from.
Note that on the backend minio is just storing objects as files on disk.
-## Buckets
+## Buckets and Directories
-Notable buckets, and structure/naming convention:
+Hosts and buckets:
- grobid/
- 2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml
- SHA1 (lower-case hex) of PDF that XML was extracted from
- unpaywall/grobid/
- 2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml
+ localhost:sandcrawler-dev
+ create locally for development (see below)
+
+ cluster:sandcrawler
+ main sandcrawler storage bucket, for GROBID output and other derivatives.
+ Note it isn't "sandcrawler-prod", for backwards compatibility reasons.
+
+ cluster:sandcrawler-qa
+ for, eg, testing on cluster servers
+
+ cluster:unpaywall
+ subset of sandcrawler content crawled due to unpaywall URLs;
+ potentially made publicly accessible
+
+Directory structure within sandcrawler buckets:
+
+ grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml
SHA1 (lower-case hex) of PDF that XML was extracted from
- (mirror of /grobid/ for which we crawled for unpaywall and made publicly accessible)
Create new buckets like:
- mc mb sandcrawler/grobid
+ mc mb cluster/sandcrawler-qa
+
+## Development
+
+Run minio server locally, with non-persisted data:
+
+ docker run -p 9000:9000 minio/minio server /data
+
+Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and
+configure:
+
+ mc config host add localhost http://localhost:9000 minioadmin minioadmin
+
+Then create dev bucket:
+
+ mc mb --ignore-existing localhost/sandcrawler-dev
+
+A common "gotcha" with `mc` command is that it will first look for a local
+folder/directory with same name as the configured remote host, so make sure
+there isn't a `./localhost` folder.
+
## Users
@@ -28,4 +59,5 @@ Create a new readonly user like:
Make a prefix within a bucket world-readable like:
- mc policy set download sandcrawler/unpaywall/grobid
+ mc policy set download cluster/unpaywall/grobid
+