diff options
Diffstat (limited to 'blobs')
-rw-r--r-- | blobs/README.md | 86 | ||||
-rw-r--r-- | blobs/minio/README.md | 74 | ||||
-rw-r--r-- | blobs/minio/minio.conf | 14 | ||||
-rw-r--r-- | blobs/seaweedfs/README.md | 9 | ||||
-rw-r--r-- | blobs/tasks.md | 53 |
5 files changed, 0 insertions, 236 deletions
diff --git a/blobs/README.md b/blobs/README.md deleted file mode 100644 index 555db92..0000000 --- a/blobs/README.md +++ /dev/null @@ -1,86 +0,0 @@ - -This document describes sandcrawler/fatcat use of "blob store" infrastructure -for storing hundreds of millions of small files. For example, GROBID XML -documents, jpeg thumbnails of PDFs. - -The basic feature requirements for this system are: - -- don't need preservation data resiliency: all this data is derived from - primary content, and is usually redundantly stored in Kafka topics (and thus - can be re-indexed to any server bounded only by throughput of the object - store service; Kafka is usually faster) -- don't require SSDs or large amounts of RAM. Ability to accelerate performance - with additional RAM or moving indexes to SSD is nice, but we will be using - spinning disks for primary data storage -- hundreds of millions or billions of objects, fetchable by a key we define -- optional transparent compression (for text and XML) -- typical object (file) size of 5-200 KBytes uncompressed, want to support up - to several MBytes -- very simple internal API for GET/PUT (S3 API compatible is good) -- ability to proxy to HTTP publicly for reads (eg, HTTP fall-back with no - authenticaiton), controllable by at least bucket granularity - -## Infrastructure - -`minio` was used initially, but did not scale well in number of files. We -currently use seaweedfs. Any S3-compatible key/value store should work in -theory. openlibrary.org has used WARCs in petabox items in the past. Actual -cloud object stores tend to be expensive for this kind of use case. - -The facebook "haystack" project (and whitepaper) are good background reading -describing one type of system that works well for this application. - - -## Bucket / Folder Structure - -Currently we run everything off a single server, with no redundancy. There is -no QA/prod distinction. - -Setting access control and doing bulk deletions is easiest at the bucket level, -less easy at the folder level, most difficult at the suffix (file extention) -level. - -For files that are derived from PDFs, we use the SHA-1 (in lower-case hex) of -the source PDF to contruct keys. We generate nested "directories" from the hash -to limit the number of keys per "directory" (even though in S3/seaweedfs there -are no actual directories involved). The structure looks like: - - <bucket>/<folder>/<byte0>/<byte1>/<sha1hex><suffix> - -Eg: - - sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml - -The nesting is sort of a hold-over from minio (where files were actually -on-disk), but seems worth keeping in case we end up switching storage systems -in the future. - -## Existing Content - -sandcrawler: internal/controlled access to PDF derivatives - grobid: TEI-XML documents - extension: .tei.xml - text: raw pdftotext (or other text transform) - extension: .txt - -thumbnail: public bucket for thumbnail images - pdf: thumbnails from PDF files - extension: .180px.jpg - -## Proxy and URLs - -Internal HTTP access via: - - http://wbgrp-svc169.us.archive.org:8333/<bucket>/<key> - -Public access via: - - https://blobs.fatcat.wiki/<bucket>/<key> - -Eg: - - http://wbgrp-svc169.us.archive.org:8333/testing/small.txt - http://wbgrp-svc169.us.archive.org:8333/sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml - https://blobs.fatcat.wiki/testing/small.txt - https://blobs.fatcat.wiki/thumbnail/pdf/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.180px.jpg - diff --git a/blobs/minio/README.md b/blobs/minio/README.md deleted file mode 100644 index d8f1c69..0000000 --- a/blobs/minio/README.md +++ /dev/null @@ -1,74 +0,0 @@ - -minio is used as an S3-compatible blob store. Initial use case is GROBID XML -documents, addressed by the sha1 of the PDF file the XML was extracted from. - -Note that on the backend minio is just storing objects as files on disk. - -## Deploying minio Server - -It seems to be important to use a version of minio from at least December 2019 -era for on-disk compression to actually work. - -Currently install minio (and mc, the minio client) in prod by simply -downloading the binaries and calling from systemd. - -## Buckets and Directories - -Hosts and buckets: - - localhost:sandcrawler-dev - create locally for development (see below) - - cluster:sandcrawler - main sandcrawler storage bucket, for GROBID output and other derivatives. - Note it isn't "sandcrawler-prod", for backwards compatibility reasons. - - cluster:sandcrawler-qa - for, eg, testing on cluster servers - - cluster:unpaywall - subset of sandcrawler content crawled due to unpaywall URLs; - potentially made publicly accessible - -Directory structure within sandcrawler buckets: - - grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml - SHA1 (lower-case hex) of PDF that XML was extracted from - -Create new buckets like: - - mc mb cluster/sandcrawler-qa - -## Development - -Run minio server locally, with non-persisted data: - - docker run -p 9000:9000 minio/minio server /data - -Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and -configure: - - mc config host add localhost http://localhost:9000 minioadmin minioadmin - -Then create dev bucket: - - mc mb --ignore-existing localhost/sandcrawler-dev - -A common "gotcha" with `mc` command is that it will first look for a local -folder/directory with same name as the configured remote host, so make sure -there isn't a `./localhost` folder. - - -## Users - -Create a new readonly user like: - - mc admin user add sandcrawler unpaywall $RANDOM_SECRET_KEY readonly - -Make a prefix within a bucket world-readable like: - - mc policy set download cluster/unpaywall/grobid - -## Config - - mc admin config set aitio compression extensions=.txt,.log,.csv,.json,.tsv,.pdf,.xml mime_types=text/csv,text/plain,application/json,application/xml,application/octet-stream,application/tei+xml diff --git a/blobs/minio/minio.conf b/blobs/minio/minio.conf deleted file mode 100644 index 2e93f9a..0000000 --- a/blobs/minio/minio.conf +++ /dev/null @@ -1,14 +0,0 @@ - -# Volume to be used for MinIO server. -MINIO_VOLUMES="/sandcrawler-minio/data" -# Use if you want to run MinIO on a custom port. -MINIO_OPTS="--address :9000" -# Access Key of the server. -MINIO_ACCESS_KEY=REDACTED -# Secret key of the server. -MINIO_SECRET_KEY=REDACTED - -# may need to set these manually using `mc admin config get`, edit the JSON, then `set` -MINIO_COMPRESS="on" -MINIO_COMPRESS_EXTENSIONS=".txt,.log,.csv,.json,.tar,.xml,.bin,.pdf,.tsv" -MINIO_COMPRESS_MIME_TYPES="text/*,application/json,application/xml,application/pdf,application/octet-stream" diff --git a/blobs/seaweedfs/README.md b/blobs/seaweedfs/README.md deleted file mode 100644 index d19e9e0..0000000 --- a/blobs/seaweedfs/README.md +++ /dev/null @@ -1,9 +0,0 @@ - -## HOWTO: Create new bucket in SeaweedFS - -Log in to the seaweedfs VM. - -Run `weed shell` to start a shell, then: - - bucket.create -name <bucket> - diff --git a/blobs/tasks.md b/blobs/tasks.md deleted file mode 100644 index beb765f..0000000 --- a/blobs/tasks.md +++ /dev/null @@ -1,53 +0,0 @@ - -## Backfill GROBID XML to Blob Store - -Initially ran this when spinning up new seaweedfs server to replace minio. At -this time grobid persist worker was in db-only mode, as minio was too slow to -accept uploads. Rough plan is to: - -1. run grobid persist worker from Kafka with a new temporary consumer group, - from the start of the GROBID output topic -2. when it gets to end, stop the *regular* consumer group while this one is - still running. with temporary worker still running, at that point in time - entire topic should be in S3 -3. then reconfigure regular worker to db+s3 mode. halt the temporary worker, - restart the regular one with new config, run it indefinitely - -Consumer group isn't an arg, so just edit `persist_worker.py` and set it to -`persist-grobid-seaweedfs`. Also needed to patch a bit so `--s3-only` mode -didn't try to connect to postgresql. - -Commands: - - ./sandcrawler_worker.py --kafka-hosts wbgrp-svc350.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only - => Consuming from kafka topic sandcrawler-prod.grobid-output-pg, group persist-grobid-seaweed - => run briefly, then kill - -On kafka-broker worker: - - ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --reset-offsets --to-earliest --group persist-grobid-seaweed --topic sandcrawler-prod.grobid-output-pg --dry-run - -Then run 2x instances of worker (same command as above): - - ./sandcrawler_worker.py --kafka-hosts wbgrp-svc350.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only - -At this point CPU-limited on this worker by the python processes (only 4 cores -on this machine). - -Check in weed shell: - - weed shell - - > > fs.meta.cat buckets/sandcrawler/grobid/00/00/000068a76ab125389506e8834483c6ba4c73338a.tei.xml - [...] - "isGzipped": false - [...] - "mime": "application/xml", - [...] - -An open question is if we should have separate buckets per derive type. Eg, a -GROBID XML bucket separate from thumbnails bucket. Or are prefix directories -enough. Basically this comes down to whether we want things mixed together at -the volume level. I think we should keep separate. - -Need to set the mimetype in the upload for gzip on XML? |