aboutsummaryrefslogtreecommitdiffstats
path: root/blobs
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2022-12-23 15:52:02 -0800
committerBryan Newbold <bnewbold@archive.org>2022-12-23 15:52:02 -0800
commitf3a721a9dce8801b78f7bc31e88dc912b0ec1dba (patch)
treefdae9373e78671d0031f83045e6c76de9ad616cf /blobs
parent8c2c354a74064f2d66644af8f4e44d74bf322e1f (diff)
downloadsandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.tar.gz
sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.zip
move a bunch of top-level files/directories to ./extra/
Diffstat (limited to 'blobs')
-rw-r--r--blobs/README.md86
-rw-r--r--blobs/minio/README.md74
-rw-r--r--blobs/minio/minio.conf14
-rw-r--r--blobs/seaweedfs/README.md9
-rw-r--r--blobs/tasks.md53
5 files changed, 0 insertions, 236 deletions
diff --git a/blobs/README.md b/blobs/README.md
deleted file mode 100644
index 555db92..0000000
--- a/blobs/README.md
+++ /dev/null
@@ -1,86 +0,0 @@
-
-This document describes sandcrawler/fatcat use of "blob store" infrastructure
-for storing hundreds of millions of small files. For example, GROBID XML
-documents, jpeg thumbnails of PDFs.
-
-The basic feature requirements for this system are:
-
-- don't need preservation data resiliency: all this data is derived from
- primary content, and is usually redundantly stored in Kafka topics (and thus
- can be re-indexed to any server bounded only by throughput of the object
- store service; Kafka is usually faster)
-- don't require SSDs or large amounts of RAM. Ability to accelerate performance
- with additional RAM or moving indexes to SSD is nice, but we will be using
- spinning disks for primary data storage
-- hundreds of millions or billions of objects, fetchable by a key we define
-- optional transparent compression (for text and XML)
-- typical object (file) size of 5-200 KBytes uncompressed, want to support up
- to several MBytes
-- very simple internal API for GET/PUT (S3 API compatible is good)
-- ability to proxy to HTTP publicly for reads (eg, HTTP fall-back with no
- authenticaiton), controllable by at least bucket granularity
-
-## Infrastructure
-
-`minio` was used initially, but did not scale well in number of files. We
-currently use seaweedfs. Any S3-compatible key/value store should work in
-theory. openlibrary.org has used WARCs in petabox items in the past. Actual
-cloud object stores tend to be expensive for this kind of use case.
-
-The facebook "haystack" project (and whitepaper) are good background reading
-describing one type of system that works well for this application.
-
-
-## Bucket / Folder Structure
-
-Currently we run everything off a single server, with no redundancy. There is
-no QA/prod distinction.
-
-Setting access control and doing bulk deletions is easiest at the bucket level,
-less easy at the folder level, most difficult at the suffix (file extention)
-level.
-
-For files that are derived from PDFs, we use the SHA-1 (in lower-case hex) of
-the source PDF to contruct keys. We generate nested "directories" from the hash
-to limit the number of keys per "directory" (even though in S3/seaweedfs there
-are no actual directories involved). The structure looks like:
-
- <bucket>/<folder>/<byte0>/<byte1>/<sha1hex><suffix>
-
-Eg:
-
- sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml
-
-The nesting is sort of a hold-over from minio (where files were actually
-on-disk), but seems worth keeping in case we end up switching storage systems
-in the future.
-
-## Existing Content
-
-sandcrawler: internal/controlled access to PDF derivatives
- grobid: TEI-XML documents
- extension: .tei.xml
- text: raw pdftotext (or other text transform)
- extension: .txt
-
-thumbnail: public bucket for thumbnail images
- pdf: thumbnails from PDF files
- extension: .180px.jpg
-
-## Proxy and URLs
-
-Internal HTTP access via:
-
- http://wbgrp-svc169.us.archive.org:8333/<bucket>/<key>
-
-Public access via:
-
- https://blobs.fatcat.wiki/<bucket>/<key>
-
-Eg:
-
- http://wbgrp-svc169.us.archive.org:8333/testing/small.txt
- http://wbgrp-svc169.us.archive.org:8333/sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml
- https://blobs.fatcat.wiki/testing/small.txt
- https://blobs.fatcat.wiki/thumbnail/pdf/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.180px.jpg
-
diff --git a/blobs/minio/README.md b/blobs/minio/README.md
deleted file mode 100644
index d8f1c69..0000000
--- a/blobs/minio/README.md
+++ /dev/null
@@ -1,74 +0,0 @@
-
-minio is used as an S3-compatible blob store. Initial use case is GROBID XML
-documents, addressed by the sha1 of the PDF file the XML was extracted from.
-
-Note that on the backend minio is just storing objects as files on disk.
-
-## Deploying minio Server
-
-It seems to be important to use a version of minio from at least December 2019
-era for on-disk compression to actually work.
-
-Currently install minio (and mc, the minio client) in prod by simply
-downloading the binaries and calling from systemd.
-
-## Buckets and Directories
-
-Hosts and buckets:
-
- localhost:sandcrawler-dev
- create locally for development (see below)
-
- cluster:sandcrawler
- main sandcrawler storage bucket, for GROBID output and other derivatives.
- Note it isn't "sandcrawler-prod", for backwards compatibility reasons.
-
- cluster:sandcrawler-qa
- for, eg, testing on cluster servers
-
- cluster:unpaywall
- subset of sandcrawler content crawled due to unpaywall URLs;
- potentially made publicly accessible
-
-Directory structure within sandcrawler buckets:
-
- grobid/2c/0d/2c0daa9307887a27054d4d1f137514b0fa6c6b2d.tei.xml
- SHA1 (lower-case hex) of PDF that XML was extracted from
-
-Create new buckets like:
-
- mc mb cluster/sandcrawler-qa
-
-## Development
-
-Run minio server locally, with non-persisted data:
-
- docker run -p 9000:9000 minio/minio server /data
-
-Credentials are `minioadmin:minioadmin`. Install `mc` client utility, and
-configure:
-
- mc config host add localhost http://localhost:9000 minioadmin minioadmin
-
-Then create dev bucket:
-
- mc mb --ignore-existing localhost/sandcrawler-dev
-
-A common "gotcha" with `mc` command is that it will first look for a local
-folder/directory with same name as the configured remote host, so make sure
-there isn't a `./localhost` folder.
-
-
-## Users
-
-Create a new readonly user like:
-
- mc admin user add sandcrawler unpaywall $RANDOM_SECRET_KEY readonly
-
-Make a prefix within a bucket world-readable like:
-
- mc policy set download cluster/unpaywall/grobid
-
-## Config
-
- mc admin config set aitio compression extensions=.txt,.log,.csv,.json,.tsv,.pdf,.xml mime_types=text/csv,text/plain,application/json,application/xml,application/octet-stream,application/tei+xml
diff --git a/blobs/minio/minio.conf b/blobs/minio/minio.conf
deleted file mode 100644
index 2e93f9a..0000000
--- a/blobs/minio/minio.conf
+++ /dev/null
@@ -1,14 +0,0 @@
-
-# Volume to be used for MinIO server.
-MINIO_VOLUMES="/sandcrawler-minio/data"
-# Use if you want to run MinIO on a custom port.
-MINIO_OPTS="--address :9000"
-# Access Key of the server.
-MINIO_ACCESS_KEY=REDACTED
-# Secret key of the server.
-MINIO_SECRET_KEY=REDACTED
-
-# may need to set these manually using `mc admin config get`, edit the JSON, then `set`
-MINIO_COMPRESS="on"
-MINIO_COMPRESS_EXTENSIONS=".txt,.log,.csv,.json,.tar,.xml,.bin,.pdf,.tsv"
-MINIO_COMPRESS_MIME_TYPES="text/*,application/json,application/xml,application/pdf,application/octet-stream"
diff --git a/blobs/seaweedfs/README.md b/blobs/seaweedfs/README.md
deleted file mode 100644
index d19e9e0..0000000
--- a/blobs/seaweedfs/README.md
+++ /dev/null
@@ -1,9 +0,0 @@
-
-## HOWTO: Create new bucket in SeaweedFS
-
-Log in to the seaweedfs VM.
-
-Run `weed shell` to start a shell, then:
-
- bucket.create -name <bucket>
-
diff --git a/blobs/tasks.md b/blobs/tasks.md
deleted file mode 100644
index beb765f..0000000
--- a/blobs/tasks.md
+++ /dev/null
@@ -1,53 +0,0 @@
-
-## Backfill GROBID XML to Blob Store
-
-Initially ran this when spinning up new seaweedfs server to replace minio. At
-this time grobid persist worker was in db-only mode, as minio was too slow to
-accept uploads. Rough plan is to:
-
-1. run grobid persist worker from Kafka with a new temporary consumer group,
- from the start of the GROBID output topic
-2. when it gets to end, stop the *regular* consumer group while this one is
- still running. with temporary worker still running, at that point in time
- entire topic should be in S3
-3. then reconfigure regular worker to db+s3 mode. halt the temporary worker,
- restart the regular one with new config, run it indefinitely
-
-Consumer group isn't an arg, so just edit `persist_worker.py` and set it to
-`persist-grobid-seaweedfs`. Also needed to patch a bit so `--s3-only` mode
-didn't try to connect to postgresql.
-
-Commands:
-
- ./sandcrawler_worker.py --kafka-hosts wbgrp-svc350.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only
- => Consuming from kafka topic sandcrawler-prod.grobid-output-pg, group persist-grobid-seaweed
- => run briefly, then kill
-
-On kafka-broker worker:
-
- ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --reset-offsets --to-earliest --group persist-grobid-seaweed --topic sandcrawler-prod.grobid-output-pg --dry-run
-
-Then run 2x instances of worker (same command as above):
-
- ./sandcrawler_worker.py --kafka-hosts wbgrp-svc350.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only
-
-At this point CPU-limited on this worker by the python processes (only 4 cores
-on this machine).
-
-Check in weed shell:
-
- weed shell
-
- > > fs.meta.cat buckets/sandcrawler/grobid/00/00/000068a76ab125389506e8834483c6ba4c73338a.tei.xml
- [...]
- "isGzipped": false
- [...]
- "mime": "application/xml",
- [...]
-
-An open question is if we should have separate buckets per derive type. Eg, a
-GROBID XML bucket separate from thumbnails bucket. Or are prefix directories
-enough. Basically this comes down to whether we want things mixed together at
-the volume level. I think we should keep separate.
-
-Need to set the mimetype in the upload for gzip on XML?