1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
|
minio is used as an S3-compatible blob store. Initial use case is GROBID XML
documents, addressed by the sha1 of the PDF file the XML was extracted from.
Note that on the backend minio is just storing objects as files on disk.
## Deploying minio Server
It seems to be important to use a version of minio from at least December 2019
era for on-disk compression to actually work.
Currently install minio (and mc, the minio client) in prod by simply
downloading the binaries and calling from systemd.
## Buckets and Directories
Hosts and buckets:
localhost:sandcrawler-dev
create locally for development (see below)
cluster:sandcrawler
main sandcrawler storage bucket, for GROBID output and other derivatives.
Note it isn't "sandcrawler-prod", for backwards compatibility reasons.
cluster:sandcrawler-qa
for, eg, testing on cluster servers
cluster:unpaywall
subset of sandcrawler content crawled due to unpaywall URLs;
potentially made publicly accessible
Directory structure within sandcrawler buckets:
grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml
SHA1 (lower-case hex) of PDF that XML was extracted from
Create new buckets like:
mc mb cluster/sandcrawler-qa
## Development
Run minio server locally, with non-persisted data:
docker run -p 9000:9000 minio/minio server /data
Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and
configure:
mc config host add localhost http://localhost:9000 minioadmin minioadmin
Then create dev bucket:
mc mb --ignore-existing localhost/sandcrawler-dev
A common "gotcha" with `mc` command is that it will first look for a local
folder/directory with same name as the configured remote host, so make sure
there isn't a `./localhost` folder.
## Users
Create a new readonly user like:
mc admin user add sandcrawler unpaywall $RANDOM_SECRET_KEY readonly
Make a prefix within a bucket world-readable like:
mc policy set download cluster/unpaywall/grobid
## Config
mc admin config set aitio compression extensions=.txt,.log,.csv,.json,.tsv,.pdf,.xml mime_types=text/csv,text/plain,application/json,application/xml,application/octet-stream,application/tei+xml
|