1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
|
status: work-in-progress
New PDF derivatives: thumbnails, metadata, raw text
===================================================
To support scholar.archive.org (fulltext search) and other downstream uses of
fatcat, want to extract from many PDFs:
- pdf structured metadata
- thumbnail images
- raw extracted text
A single worker should extract all of these fields, and publish in to two kafka
streams. Separate persist workers consume from the streams and push in to SQL
and/or seaweedfs.
Additionally, this extraction should happen automatically for newly-crawled
PDFs as part of the ingest pipeline. When possible, checks should be run
against the existing SQL table to avoid duplication of processing.
## PDF Metadata and Text
Kafka topic (name: `sandcrawler-ENV.pdf-text`; 12x partitions; gzip
compression) JSON schema:
sha1hex (string; used as key)
status (string)
text (string)
page0_thumbnail (boolean)
meta_xml (string)
pdf_info (object)
pdf_extra (object)
word_count
file_meta (object)
source (object)
For the SQL table we should have columns for metadata fields that are *always*
saved, and put a subset of other interesting fields in a JSON blob. We don't
need all metadata fields in SQL. Full metadata/info will always be available in
Kafka, and we don't want SQL table size to explode. Schema:
CREATE TABLE IF NOT EXISTS pdf_meta (
sha1hex TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
status TEXT CHECK (octet_length(status) >= 1) NOT NULL,
has_page0_thumbnail BOOLEAN NOT NULL,
page_count INT CHECK (page_count >= 0),
word_count INT CHECK (word_count >= 0),
page0_height REAL CHECK (page0_height >= 0),
page0_width REAL CHECK (page0_width >= 0),
permanent_id TEXT CHECK (octet_length(permanent_id) >= 1),
pdf_created TIMESTAMP WITH TIME ZONE,
pdf_version TEXT CHECK (octet_length(pdf_version) >= 1),
metadata JSONB
-- maybe some analysis of available fields?
-- metadata JSON fields:
-- title
-- subject
-- author
-- creator
-- producer
-- CrossMarkDomains
-- doi
-- form
-- encrypted
);
## Thumbnail Images
Kafka Schema is raw image bytes as message body; sha1sum of PDF as the key. No
compression, 12x partitions.
Kafka topic name is `sandcrawler-ENV.pdf-thumbnail-SIZE-TYPE` (eg,
`sandcrawler-qa.pdf-thumbnail-180px-jpg`). Thus, topic name contains the
"metadata" of thumbail size/shape.
Have decided to use JPEG thumbnails, 180px wide (and max 300px high, though
width restriction is almost always the limiting factor). This size matches that
used on archive.org, and is slightly larger than the thumbnails currently used
on scholar.archive.org prototype. We intend to tweak the scholar.archive.org
CSS to use the full/raw thumbnail image at max desktop size. At this size it
would be difficult (though maybe not impossible?) to extract text (other than
large-font titles).
### Implementation
We use the `poppler` CPP library (wrapper for python) to extract and convert everything.
Some example usage of the `python-poppler` library:
import poppler
from PIL import Image
pdf = poppler.load_from_file("/home/bnewbold/10.1038@s41551-020-0534-9.pdf")
pdf.pdf_id
page = pdf.create_page(0)
page.page_rect().width
renderer = poppler.PageRenderer()
full_page = renderer.render_page(page)
img = Image.frombuffer("RGBA", (full_page.width, full_page.height), full_page.data, 'raw', "RGBA")
img.thumbnail((180,300), Image.BICUBIC)
img.save("something.jpg")
## Deployment and Infrastructure
Deployment will involve:
- sandcrawler DB SQL table
=> guesstimate size 100 GByte for hundreds of PDFs
- postgrest/SQL access to new table for internal HTTP API hits
- seaweedfs raw text folder
=> reuse existing bucket with GROBID XML; same access restrictions on content
- seaweedfs thumbnail bucket
=> new bucket for this world-public content
- public nginx access to seaweed thumbnail bucket
- extraction work queue kafka topic
=> same schema/semantics as ungrobided
- text/metadata kafka topic
- thumbnail kafka topic
- text/metadata persist worker(s)
=> from kafka; metadata to SQL database; text to seaweedfs blob store
- thumbnail persist worker
=> from kafka to seaweedfs blob store
- pdf extraction worker pool
=> very similar to GROBID worker pool
- ansible roles for all of the above
Plan for processing/catchup is:
- test with COVID-19 PDF corpus
- run extraction on all current fatcat files available via IA
- integrate with ingest pipeline for all new files
- run a batch catchup job over all GROBID-parsed files with no pdf meta
extracted, on basis of SQL table query
## Appendix: Thumbnail Size and Format Experimentation
Using 190 PDFs from `/data/pdfs/random_crawl/files` on my laptop to test.
TODO: actually, 4x images failed to convert with pdftocairo; this throws off
"mean" sizes by a small amount.
time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -png {} /tmp/test-png/{}.png
real 0m29.314s
user 0m26.794s
sys 0m2.484s
=> missing: 4
=> min: 0.8k
=> max: 57K
=> mean: 16.4K
=> total: 3120K
time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -jpeg {} /tmp/test-jpeg/{}.jpg
real 0m26.289s
user 0m24.022s
sys 0m2.490s
=> missing: 4
=> min: 1.2K
=> max: 13K
=> mean: 8.02k
=> total: 1524K
time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -jpeg -jpegopt optimize=y,quality=80 {} /tmp/test-jpeg2/{}.jpg
real 0m27.401s
user 0m24.941s
sys 0m2.519s
=> missing: 4
=> min: 577
=> max: 14K
=> mean:
=> total: 1540K
time ls | parallel -j1 convert -resize 200x200 {}[0] /tmp/magick-png/{}.png
=> missing: 4
real 1m19.399s
user 1m17.150s
sys 0m6.322s
=> min: 1.1K
=> max: 325K
=> mean:
=> total: 8476K
time ls | parallel -j1 convert -resize 200x200 {}[0] /tmp/magick-jpeg/{}.jpg
real 1m21.766s
user 1m17.040s
sys 0m7.155s
=> total: 3484K
NOTE: the following `pdf_thumbnail.py` images are somewhat smaller than the above
jpg and pngs (max 180px wide, not 200px wide)
time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-png/{}.png
real 0m48.198s
user 0m42.997s
sys 0m4.509s
=> missing: 2; 2x additional stub images
=> total: 5904K
time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg/{}.jpg
real 0m45.252s
user 0m41.232s
sys 0m4.273s
=> min: 1.4K
=> max: 16K
=> mean: ~9.3KByte
=> total: 1772K
time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg-360/{}.jpg
real 0m48.639s
user 0m44.121s
sys 0m4.568s
=> mean: ~28k
=> total: 5364K (3x of 180px batch)
quality=95
time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg2-360/{}.jpg
real 0m49.407s
user 0m44.607s
sys 0m4.869s
=> total: 9812K
quality=95
time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg2-180/{}.jpg
real 0m45.901s
user 0m41.486s
sys 0m4.591s
=> mean: 16.4K
=> total: 3116K
At the 180px size, the difference between default and quality=95 seems
indistinguishable visually to me, but is more than a doubling of file size.
Also tried at 300px and seems near-indistinguishable there as well.
At a mean of 10 Kbytes per file:
10 million -> 100 GBytes
100 million -> 1 Tbyte
Older COVID-19 thumbnails were about 400px wide:
pdftocairo -png -singlefile -scale-to-x 400 -scale-to-y -1
Display on scholar-qa.archive.org is about 135x181px
archive.org does 180px wide
Unclear if we should try to do double resolution for high DPI screens (eg,
apple "retina").
Using same size as archive.org probably makes the most sense: max 180px wide,
preserve aspect ratio. And jpeg improvement seems worth it.
#### Merlijn notes
From work on optimizing microfilm thumbnail images:
When possible, generate a thumbnail that fits well on the screen of the
user. Always creating a large thumbnail will result in the browsers
downscaling them, leading to fuzzy text. If it’s not possible, then create
the pick the resolution you’d want to support (1.5x or 2x scaling) and
create thumbnails of that size, but also apply the other recommendations
below - especially a sharpening filter.
Use bicubic or lanczos interpolation. Bilinear and nearest neighbour are
not OK.
For text, consider applying a sharpening filter. Not a strong one, but some
sharpening can definitely help.
## Appendix: PDF Info Fields
From `pdfinfo` manpage:
The ´Info' dictionary contains the following values:
title
subject
keywords
author
creator
producer
creation date
modification date
In addition, the following information is printed:
tagged (yes/no)
form (AcroForm / XFA / none)
javascript (yes/no)
page count
encrypted flag (yes/no)
print and copy permissions (if encrypted)
page size
file size
linearized (yes/no)
PDF version
metadata (only if requested)
For an example file, the output looks like:
Title: A mountable toilet system for personalized health monitoring via the analysis of excreta
Subject: Nature Biomedical Engineering, doi:10.1038/s41551-020-0534-9
Keywords:
Author: Seung-min Park
Creator: Springer
CreationDate: Thu Mar 26 01:26:57 2020 PDT
ModDate: Thu Mar 26 01:28:06 2020 PDT
Tagged: no
UserProperties: no
Suspects: no
Form: AcroForm
JavaScript: no
Pages: 14
Encrypted: no
Page size: 595.276 x 790.866 pts
Page rot: 0
File size: 6104749 bytes
Optimized: yes
PDF version: 1.4
For context on the `pdf_id` fields ("original" and "updated"), read:
<https://web.hypothes.is/blog/synchronizing-annotations-between-local-and-remote-pdfs/>
|