aboutsummaryrefslogtreecommitdiffstats
path: root/minio/README.md
blob: d8f1c693be03c56e9d5ce954e4e1ea3595f8dbd0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
minio is used as an S3-compatible blob store. Initial use case is GROBID XML
documents, addressed by the sha1 of the PDF file the XML was extracted from.

Note that on the backend minio is just storing objects as files on disk.

## Deploying minio Server

It seems to be important to use a version of minio from at least December 2019
era for on-disk compression to actually work.

Currently install minio (and mc, the minio client) in prod by simply
downloading the binaries and calling from systemd.

## Buckets and Directories

Hosts and buckets:

    localhost:sandcrawler-dev
        create locally for development (see below)

    cluster:sandcrawler
        main sandcrawler storage bucket, for GROBID output and other derivatives.
        Note it isn't "sandcrawler-prod", for backwards compatibility reasons.

    cluster:sandcrawler-qa
        for, eg, testing on cluster servers

    cluster:unpaywall
        subset of sandcrawler content crawled due to unpaywall URLs;
        potentially made publicly accessible

Directory structure within sandcrawler buckets:

    grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml
        SHA1 (lower-case hex) of PDF that XML was extracted from

Create new buckets like:

    mc mb cluster/sandcrawler-qa

## Development

Run minio server locally, with non-persisted data:

    docker run -p 9000:9000 minio/minio server /data

Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and
configure:

    mc config host add localhost http://localhost:9000 minioadmin minioadmin

Then create dev bucket:

    mc mb --ignore-existing localhost/sandcrawler-dev

A common "gotcha" with `mc` command is that it will first look for a local
folder/directory with same name as the configured remote host, so make sure
there isn't a `./localhost` folder.


## Users

Create a new readonly user like:

    mc admin user add sandcrawler unpaywall $RANDOM_SECRET_KEY readonly

Make a prefix within a bucket world-readable like:

    mc policy set download cluster/unpaywall/grobid

## Config

    mc admin config set aitio compression extensions=.txt,.log,.csv,.json,.tsv,.pdf,.xml mime_types=text/csv,text/plain,application/json,application/xml,application/octet-stream,application/tei+xml