1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
|
This document describes sandcrawler/fatcat use of "blob store" infrastructure
for storing hundreds of millions of small files. For example, GROBID XML
documents, jpeg thumbnails of PDFs.
The basic feature requirements for this system are:
- don't need preservation data resiliency: all this data is derived from
primary content, and is usually redundantly stored in Kafka topics (and thus
can be re-indexed to any server bounded only by throughput of the object
store service; Kafka is usually faster)
- don't require SSDs or large amounts of RAM. Ability to accelerate performance
with additional RAM or moving indexes to SSD is nice, but we will be using
spinning disks for primary data storage
- hundreds of millions or billions of objects, fetchable by a key we define
- optional transparent compression (for text and XML)
- typical object (file) size of 5-200 KBytes uncompressed, want to support up
to several MBytes
- very simple internal API for GET/PUT (S3 API compatible is good)
- ability to proxy to HTTP publicly for reads (eg, HTTP fall-back with no
authenticaiton), controllable by at least bucket granularity
## Infrastructure
`minio` was used initially, but did not scale well in number of files. We
currently use seaweedfs. Any S3-compatible key/value store should work in
theory. openlibrary.org has used WARCs in petabox items in the past. Actual
cloud object stores tend to be expensive for this kind of use case.
The facebook "haystack" project (and whitepaper) are good background reading
describing one type of system that works well for this application.
## Bucket / Folder Structure
Currently we run everything off a single server, with no redundancy. There is
no QA/prod distinction.
Setting access control and doing bulk deletions is easiest at the bucket level,
less easy at the folder level, most difficult at the suffix (file extention)
level.
For files that are derived from PDFs, we use the SHA-1 (in lower-case hex) of
the source PDF to contruct keys. We generate nested "directories" from the hash
to limit the number of keys per "directory" (even though in S3/seaweedfs there
are no actual directories involved). The structure looks like:
<bucket>/<folder>/<byte0>/<byte1>/<sha1hex><suffix>
Eg:
sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml
The nesting is sort of a hold-over from minio (where files were actually
on-disk), but seems worth keeping in case we end up switching storage systems
in the future.
## Existing Content
sandcrawler: internal/controlled access to PDF derivatives
grobid: TEI-XML documents
extension: .tei.xml
text: raw pdftotext (or other text transform)
extension: .txt
thumbnail: public bucket for thumbnail images
pdf: thumbnails from PDF files
extension: .180px.jpg
## Proxy and URLs
Internal HTTP access via:
http://wbgrp-svc169.us.archive.org:8333/<bucket>/<key>
Public access via:
https://blobs.fatcat.wiki/<bucket>/<key>
Eg:
http://wbgrp-svc169.us.archive.org:8333/testing/small.txt
http://wbgrp-svc169.us.archive.org:8333/sandcrawler/grobid/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.tei.xml
https://blobs.fatcat.wiki/testing/small.txt
https://blobs.fatcat.wiki/thumbnail/pdf/1a/64/1a6462a925a9767b797fe6085093b6aa9f27f523.180px.jpg
|