diff options
author | Bryan Newbold <bnewbold@archive.org> | 2019-12-26 21:14:20 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-01-02 18:12:58 -0800 |
commit | 9fda5323046cb3f87f0c7c7e07eca283ca52ce99 (patch) | |
tree | 37b8d53777ca320db52cf67af15170f7c571cac2 /minio | |
parent | 33f6744b56a9ca7b01cb4ed7b80bdf70a972ffa8 (diff) | |
download | sandcrawler-9fda5323046cb3f87f0c7c7e07eca283ca52ce99.tar.gz sandcrawler-9fda5323046cb3f87f0c7c7e07eca283ca52ce99.zip |
update minio README
Diffstat (limited to 'minio')
-rw-r--r-- | minio/README.md | 52 |
1 files changed, 42 insertions, 10 deletions
diff --git a/minio/README.md b/minio/README.md index 3ce0f95..fd914f0 100644 --- a/minio/README.md +++ b/minio/README.md @@ -4,21 +4,52 @@ documents, addressed by the sha1 of the PDF file the XML was extracted from. Note that on the backend minio is just storing objects as files on disk. -## Buckets +## Buckets and Directories -Notable buckets, and structure/naming convention: +Hosts and buckets: - grobid/ - 2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml - SHA1 (lower-case hex) of PDF that XML was extracted from - unpaywall/grobid/ - 2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml + localhost:sandcrawler-dev + create locally for development (see below) + + cluster:sandcrawler + main sandcrawler storage bucket, for GROBID output and other derivatives. + Note it isn't "sandcrawler-prod", for backwards compatibility reasons. + + cluster:sandcrawler-qa + for, eg, testing on cluster servers + + cluster:unpaywall + subset of sandcrawler content crawled due to unpaywall URLs; + potentially made publicly accessible + +Directory structure within sandcrawler buckets: + + grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml SHA1 (lower-case hex) of PDF that XML was extracted from - (mirror of /grobid/ for which we crawled for unpaywall and made publicly accessible) Create new buckets like: - mc mb sandcrawler/grobid + mc mb cluster/sandcrawler-qa + +## Development + +Run minio server locally, with non-persisted data: + + docker run -p 9000:9000 minio/minio server /data + +Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and +configure: + + mc config host add localhost http://localhost:9000 minioadmin minioadmin + +Then create dev bucket: + + mc mb --ignore-existing localhost/sandcrawler-dev + +A common "gotcha" with `mc` command is that it will first look for a local +folder/directory with same name as the configured remote host, so make sure +there isn't a `./localhost` folder. + ## Users @@ -28,4 +59,5 @@ Create a new readonly user like: Make a prefix within a bucket world-readable like: - mc policy set download sandcrawler/unpaywall/grobid + mc policy set download cluster/unpaywall/grobid + |