diff options
Diffstat (limited to 'extra/blobs')
-rw-r--r-- | extra/blobs/README.md | 86 | ||||
-rw-r--r-- | extra/blobs/minio/README.md | 74 | ||||
-rw-r--r-- | extra/blobs/minio/minio.conf | 14 | ||||
-rw-r--r-- | extra/blobs/seaweedfs/README.md | 9 | ||||
-rw-r--r-- | extra/blobs/tasks.md | 53 |
5 files changed, 236 insertions, 0 deletions
diff --git a/extra/blobs/README.md b/extra/blobs/README.md new file mode 100644 index 0000000..555db92 --- /dev/null +++ b/extra/blobs/README.md @@ -0,0 +1,86 @@ + +This document describes sandcrawler/fatcat use of "blob store" infrastructure +for storing hundreds of millions of small files. For example, GROBID XML +documents, jpeg thumbnails of PDFs. + +The basic feature requirements for this system are: + +- don't need preservation data resiliency: all this data is derived from + primary content, and is usually redundantly stored in Kafka topics (and thus + can be re-indexed to any server bounded only by throughput of the object + store service; Kafka is usually faster) +- don't require SSDs or large amounts of RAM. Ability to accelerate performance + with additional RAM or moving indexes to SSD is nice, but we will be using + spinning disks for primary data storage +- hundreds of millions or billions of objects, fetchable by a key we define +- optional transparent compression (for text and XML) +- typical object (file) size of 5-200 KBytes uncompressed, want to support up + to several MBytes +- very simple internal API for GET/PUT (S3 API compatible is good) +- ability to proxy to HTTP publicly for reads (eg, HTTP fall-back with no + authenticaiton), controllable by at least bucket granularity + +## Infrastructure + +`minio` was used initially, but did not scale well in number of files. We +currently use seaweedfs. Any S3-compatible key/value store should work in +theory. openlibrary.org has used WARCs in petabox items in the past. Actual +cloud object stores tend to be expensive for this kind of use case. + +The facebook "haystack" project (and whitepaper) are good background reading +describing one type of system that works well for this application. + + +## Bucket / Folder Structure + +Currently we run everything off a single server, with no redundancy. There is +no QA/prod distinction. + +Setting access control and doing bulk deletions is easiest at the bucket level, +less easy at the folder level, most difficult at the suffix (file extention) +level. + +For files that are derived from PDFs, we use the SHA-1 (in lower-case hex) of +the source PDF to contruct keys. We generate nested "directories" from the hash +to limit the number of keys per "directory" (even though in S3/seaweedfs there +are no actual directories involved). The structure looks like: + + <bucket>/<folder>/<byte0>/<byte1>/<sha1hex><suffix> + +Eg: + + sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml + +The nesting is sort of a hold-over from minio (where files were actually +on-disk), but seems worth keeping in case we end up switching storage systems +in the future. + +## Existing Content + +sandcrawler: internal/controlled access to PDF derivatives + grobid: TEI-XML documents + extension: .tei.xml + text: raw pdftotext (or other text transform) + extension: .txt + +thumbnail: public bucket for thumbnail images + pdf: thumbnails from PDF files + extension: .180px.jpg + +## Proxy and URLs + +Internal HTTP access via: + + http://wbgrp-svc169.us.archive.org:8333/<bucket>/<key> + +Public access via: + + https://blobs.fatcat.wiki/<bucket>/<key> + +Eg: + + http://wbgrp-svc169.us.archive.org:8333/testing/small.txt + http://wbgrp-svc169.us.archive.org:8333/sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml + https://blobs.fatcat.wiki/testing/small.txt + https://blobs.fatcat.wiki/thumbnail/pdf/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.180px.jpg + diff --git a/extra/blobs/minio/README.md b/extra/blobs/minio/README.md new file mode 100644 index 0000000..d8f1c69 --- /dev/null +++ b/extra/blobs/minio/README.md @@ -0,0 +1,74 @@ + +minio is used as an S3-compatible blob store. Initial use case is GROBID XML +documents, addressed by the sha1 of the PDF file the XML was extracted from. + +Note that on the backend minio is just storing objects as files on disk. + +## Deploying minio Server + +It seems to be important to use a version of minio from at least December 2019 +era for on-disk compression to actually work. + +Currently install minio (and mc, the minio client) in prod by simply +downloading the binaries and calling from systemd. + +## Buckets and Directories + +Hosts and buckets: + + localhost:sandcrawler-dev + create locally for development (see below) + + cluster:sandcrawler + main sandcrawler storage bucket, for GROBID output and other derivatives. + Note it isn't "sandcrawler-prod", for backwards compatibility reasons. + + cluster:sandcrawler-qa + for, eg, testing on cluster servers + + cluster:unpaywall + subset of sandcrawler content crawled due to unpaywall URLs; + potentially made publicly accessible + +Directory structure within sandcrawler buckets: + + grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml + SHA1 (lower-case hex) of PDF that XML was extracted from + +Create new buckets like: + + mc mb cluster/sandcrawler-qa + +## Development + +Run minio server locally, with non-persisted data: + + docker run -p 9000:9000 minio/minio server /data + +Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and +configure: + + mc config host add localhost http://localhost:9000 minioadmin minioadmin + +Then create dev bucket: + + mc mb --ignore-existing localhost/sandcrawler-dev + +A common "gotcha" with `mc` command is that it will first look for a local +folder/directory with same name as the configured remote host, so make sure +there isn't a `./localhost` folder. + + +## Users + +Create a new readonly user like: + + mc admin user add sandcrawler unpaywall $RANDOM_SECRET_KEY readonly + +Make a prefix within a bucket world-readable like: + + mc policy set download cluster/unpaywall/grobid + +## Config + + mc admin config set aitio compression extensions=.txt,.log,.csv,.json,.tsv,.pdf,.xml mime_types=text/csv,text/plain,application/json,application/xml,application/octet-stream,application/tei+xml diff --git a/extra/blobs/minio/minio.conf b/extra/blobs/minio/minio.conf new file mode 100644 index 0000000..2e93f9a --- /dev/null +++ b/extra/blobs/minio/minio.conf @@ -0,0 +1,14 @@ + +# Volume to be used for MinIO server. +MINIO_VOLUMES="/sandcrawler-minio/data" +# Use if you want to run MinIO on a custom port. +MINIO_OPTS="--address :9000" +# Access Key of the server. +MINIO_ACCESS_KEY=REDACTED +# Secret key of the server. +MINIO_SECRET_KEY=REDACTED + +# may need to set these manually using `mc admin config get`, edit the JSON, then `set` +MINIO_COMPRESS="on" +MINIO_COMPRESS_EXTENSIONS=".txt,.log,.csv,.json,.tar,.xml,.bin,.pdf,.tsv" +MINIO_COMPRESS_MIME_TYPES="text/*,application/json,application/xml,application/pdf,application/octet-stream" diff --git a/extra/blobs/seaweedfs/README.md b/extra/blobs/seaweedfs/README.md new file mode 100644 index 0000000..d19e9e0 --- /dev/null +++ b/extra/blobs/seaweedfs/README.md @@ -0,0 +1,9 @@ + +## HOWTO: Create new bucket in SeaweedFS + +Log in to the seaweedfs VM. + +Run `weed shell` to start a shell, then: + + bucket.create -name <bucket> + diff --git a/extra/blobs/tasks.md b/extra/blobs/tasks.md new file mode 100644 index 0000000..beb765f --- /dev/null +++ b/extra/blobs/tasks.md @@ -0,0 +1,53 @@ + +## Backfill GROBID XML to Blob Store + +Initially ran this when spinning up new seaweedfs server to replace minio. At +this time grobid persist worker was in db-only mode, as minio was too slow to +accept uploads. Rough plan is to: + +1. run grobid persist worker from Kafka with a new temporary consumer group, + from the start of the GROBID output topic +2. when it gets to end, stop the *regular* consumer group while this one is + still running. with temporary worker still running, at that point in time + entire topic should be in S3 +3. then reconfigure regular worker to db+s3 mode. halt the temporary worker, + restart the regular one with new config, run it indefinitely + +Consumer group isn't an arg, so just edit `persist_worker.py` and set it to +`persist-grobid-seaweedfs`. Also needed to patch a bit so `--s3-only` mode +didn't try to connect to postgresql. + +Commands: + + ./sandcrawler_worker.py --kafka-hosts wbgrp-svc350.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only + => Consuming from kafka topic sandcrawler-prod.grobid-output-pg, group persist-grobid-seaweed + => run briefly, then kill + +On kafka-broker worker: + + ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --reset-offsets --to-earliest --group persist-grobid-seaweed --topic sandcrawler-prod.grobid-output-pg --dry-run + +Then run 2x instances of worker (same command as above): + + ./sandcrawler_worker.py --kafka-hosts wbgrp-svc350.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only + +At this point CPU-limited on this worker by the python processes (only 4 cores +on this machine). + +Check in weed shell: + + weed shell + + > > fs.meta.cat buckets/sandcrawler/grobid/00/00/000068a76ab125389506e8834483c6ba4c73338a.tei.xml + [...] + "isGzipped": false + [...] + "mime": "application/xml", + [...] + +An open question is if we should have separate buckets per derive type. Eg, a +GROBID XML bucket separate from thumbnails bucket. Or are prefix directories +enough. Basically this comes down to whether we want things mixed together at +the volume level. I think we should keep separate. + +Need to set the mimetype in the upload for gzip on XML? |