From f3a721a9dce8801b78f7bc31e88dc912b0ec1dba Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Fri, 23 Dec 2022 15:52:02 -0800
Subject: move a bunch of top-level files/directories to ./extra/

---
 blobs/README.md | 86 ---------------------------------------------------------
 1 file changed, 86 deletions(-)
 delete mode 100644 blobs/README.md

(limited to 'blobs/README.md')
diff --git a/blobs/README.md b/blobs/README.md
deleted file mode 100644
index 555db92..0000000
--- a/blobs/README.md
+++ /dev/null
@@ -1,86 +0,0 @@
-
-This document describes sandcrawler/fatcat use of "blob store" infrastructure
-for storing hundreds of millions of small files. For example, GROBID XML
-documents, jpeg thumbnails of PDFs.
-
-The basic feature requirements for this system are:
-
-- don't need preservation data resiliency: all this data is derived from
-  primary content, and is usually redundantly stored in Kafka topics (and thus
-  can be re-indexed to any server bounded only by throughput of the object
-  store service; Kafka is usually faster)
-- don't require SSDs or large amounts of RAM. Ability to accelerate performance
-  with additional RAM or moving indexes to SSD is nice, but we will be using
-  spinning disks for primary data storage
-- hundreds of millions or billions of objects, fetchable by a key we define
-- optional transparent compression (for text and XML)
-- typical object (file) size of 5-200 KBytes uncompressed, want to support up
-  to several MBytes
-- very simple internal API for GET/PUT (S3 API compatible is good)
-- ability to proxy to HTTP publicly for reads (eg, HTTP fall-back with no
-  authenticaiton), controllable by at least bucket granularity
-
-## Infrastructure
-
-`minio` was used initially, but did not scale well in number of files. We
-currently use seaweedfs. Any S3-compatible key/value store should work in
-theory. openlibrary.org has used WARCs in petabox items in the past. Actual
-cloud object stores tend to be expensive for this kind of use case.
-
-The facebook "haystack" project (and whitepaper) are good background reading
-describing one type of system that works well for this application.
-
-
-## Bucket / Folder Structure
-
-Currently we run everything off a single server, with no redundancy. There is
-no QA/prod distinction.
-
-Setting access control and doing bulk deletions is easiest at the bucket level,
-less easy at the folder level, most difficult at the suffix (file extention)
-level.
-
-For files that are derived from PDFs, we use the SHA-1 (in lower-case hex) of
-the source PDF to contruct keys. We generate nested "directories" from the hash
-to limit the number of keys per "directory" (even though in S3/seaweedfs there
-are no actual directories involved). The structure looks like:
-
-    <bucket>/<folder>/<byte0>/<byte1>/<sha1hex><suffix>
-
-Eg:
-
-    sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml
-
-The nesting is sort of a hold-over from minio (where files were actually
-on-disk), but seems worth keeping in case we end up switching storage systems
-in the future.
-
-## Existing Content
-
-sandcrawler: internal/controlled access to PDF derivatives
-    grobid: TEI-XML documents
-        extension: .tei.xml
-    text: raw pdftotext (or other text transform)
-        extension: .txt
-
-thumbnail: public bucket for thumbnail images
-    pdf: thumbnails from PDF files
-        extension: .180px.jpg
-
-## Proxy and URLs
-
-Internal HTTP access via:
-
-    http://wbgrp-svc169.us.archive.org:8333/<bucket>/<key>
-
-Public access via:
-
-    https://blobs.fatcat.wiki/<bucket>/<key>
-
-Eg:
-
-    http://wbgrp-svc169.us.archive.org:8333/testing/small.txt
-    http://wbgrp-svc169.us.archive.org:8333/sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml
-    https://blobs.fatcat.wiki/testing/small.txt
-    https://blobs.fatcat.wiki/thumbnail/pdf/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.180px.jpg
-
-- 
cgit v1.2.3