pdf thumbnail+text+meta proposal

author: Bryan Newbold <bnewbold@archive.org> 2020-06-17 13:30:24 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2020-06-17 13:30:24 -0700
commit: 28ded59c2f1e86a7f044d2e0e0fd7ecc9df09115 (patch)
tree: fcc4dead31b6140a92ab692927926e29ff6dc9a8 /proposals
parent: 2a42baa521b5a88863ac0575305a21195518da11 (diff)
download: sandcrawler-28ded59c2f1e86a7f044d2e0e0fd7ecc9df09115.tar.gz
sandcrawler-28ded59c2f1e86a7f044d2e0e0fd7ecc9df09115.zip
1 files changed, 327 insertions, 0 deletions
diff --git a/proposals/2020_pdf_meta_thumbnails.md b/proposals/2020_pdf_meta_thumbnails.md
new file mode 100644
index 0000000..d7578cb
--- /dev/null
+++ b/proposals/2020_pdf_meta_thumbnails.md
@@ -0,0 +1,327 @@
+
+status: work-in-progress
+
+New PDF derivatives: thumbnails, metadata, raw text
+===================================================
+
+To support scholar.archive.org (fulltext search) and other downstream uses of
+fatcat, want to extract from many PDFs:
+
+- pdf structured metadata
+- thumbnail images
+- raw extracted text
+
+A single worker should extract all of these fields, and publish in to two kafka
+streams. Separate persist workers consume from the streams and push in to SQL
+and/or seaweedfs.
+
+Additionally, this extraction should happen automatically for newly-crawled
+PDFs as part of the ingest pipeline. When possible, checks should be run
+against the existing SQL table to avoid duplication of processing.
+
+
+## PDF Metadata and Text
+
+Kafka topic (name: `sandcrawler-ENV.pdftext`; 12x partitions; gzip
+compression) JSON schema:
+
+    sha1hex (string; used as key)
+    status (string)
+    text (string)
+    page0_thumbnail (boolean)
+    meta_xml (string)
+    pdf_info (object)
+    pdf_extra (object)
+        word_count
+    file_meta (object)
+    source (object)
+
+For the SQL table we should have columns for metadata fields that are *always*
+saved, and put a subset of other interesting fields in a JSON blob. We don't
+need all metadata fields in SQL. Full metadata/info will always be available in
+Kafka, and we don't want SQL table size to explode. Schema:
+
+    CREATE TABLE IF NOT EXISTS pdf_meta (
+        sha1hex             TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
+        updated             TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+        status              TEXT CHECK (octet_length(status) >= 1) NOT NULL,
+        page0_thumbnail     BOOLEAN NOT NULL,
+        page_count          INT CHECK (page_count >= 0),
+        word_count          INT CHECK (word_count >= 0),
+        page0_height        FLOAT CHECK (page0_height >= 0),
+        page0_width         FLOAT CHECK (page0_width >= 0),
+        permanent_id        TEXT CHECK (octet_length(permanent_id) >= 1),
+        creation date       TIMESTAMP WITH TIME ZONE,
+        pdf_version         TEXT CHECK (octet_length(pdf_version) >= 1),
+        metadata            JSONB;
+        -- maybe some analysis of available fields?
+        -- metadata JSON fields:
+        --    title
+        --    subject
+        --    author
+        --    creator
+        --    producer
+        --    CrossMarkDomains
+        --    doi
+        --    form
+        --    encrypted
+    );
+
+
+## Thumbnail Images
+
+Kafka Schema is raw image bytes as message body; sha1sum of PDF as the key. No
+compression, 12x partitions.
+
+Topic name is `sandcrawler-ENV.thumbnail-SIZE-png`. Thus, topic name contains
+the "metadata" of thumbail size/shape.
+
+Have decided to use JPEG thumbnails, 180px wide (and max 300px high, though
+width restriction is almost always the limiting factor). This size matches that
+used on archive.org, and is slightly larger than the thumbnails currently used
+on scholar.archive.org prototype. We intend to tweak the scholar.archive.org
+CSS to use the full/raw thumbnail image at max desktop size. At this size it
+would be difficult (though maybe not impossible?) to extract text (other than
+large-font titles).
+
+
+### Implementation
+
+We use the `poppler` CPP library (wrapper for python) to extract and convert everything.
+
+Some example usage of the `python-poppler` library:
+
+    import poppler
+    from PIL import Image
+
+    pdf = poppler.load_from_file("/home/bnewbold/10.1038@s41551-020-0534-9.pdf")          
+    pdf.pdf_id
+    page = pdf.create_page(0)
+    page.page_rect().width
+
+    renderer = poppler.PageRenderer()
+    full_page = renderer.render_page(page)
+    img = Image.frombuffer("RGBA", (full_page.width, full_page.height), full_page.data, 'raw', "RGBA")
+    img.thumbnail((180,300), Image.BICUBIC)
+    img.save("something.jpg")
+
+## Deployment and Infrastructure
+
+Deployment will involve:
+
+- sandcrawler DB SQL table
+    => guesstimate size 100 GByte for hundreds of PDFs
+- postgrest/SQL access to new table for internal HTTP API hits
+- seaweedfs raw text folder
+    => reuse existing bucket with GROBID XML; same access restrictions on content
+- seaweedfs thumbnail bucket
+    => new bucket for this world-public content
+- public nginx access to seaweed thumbnail bucket
+- extraction work queue kafka topic
+    => same schema/semantics as ungrobided
+- text/metadata kafka topic
+- thumbnail kafka topic
+- text/metadata persist worker(s)
+    => from kafka; metadata to SQL database; text to seaweedfs blob store
+- thumbnail persist worker
+    => from kafka to seaweedfs blob store
+- pdf extraction worker pool
+    => very similar to GROBID worker pool
+- ansible roles for all of the above
+
+Plan for processing/catchup is:
+
+- test with COVID-19 PDF corpus
+- run extraction on all current fatcat files avaiable via IA
+- integrate with ingest pipeline for all new files
+- run a batch catchup job over all GROBID-parsed files with no pdf meta
+  extracted, on basis of SQL table query
+
+## Appendix: Thumbnail Size and Format Experimentation
+
+Using 190 PDFs from `/data/pdfs/random_crawl/files` on my laptop to test.
+
+TODO: actually, 4x images failed to convert with pdftocairo; this throws off
+"mean" sizes by a small amount.
+
+    time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -png {} /tmp/test-png/{}.png
+    real    0m29.314s
+    user    0m26.794s
+    sys     0m2.484s
+    => missing: 4
+    => min: 0.8k
+    => max: 57K
+    => mean: 16.4K
+    => total: 3120K
+
+    time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -jpeg {} /tmp/test-jpeg/{}.jpg
+    real    0m26.289s
+    user    0m24.022s
+    sys     0m2.490s
+    => missing: 4
+    => min: 1.2K
+    => max: 13K
+    => mean: 8.02k
+    => total: 1524K
+
+    time ls | parallel -j1 pdftocairo -singlefile -scale-to 200 -jpeg -jpegopt optimize=y,quality=80 {} /tmp/test-jpeg2/{}.jpg
+    real    0m27.401s
+    user    0m24.941s
+    sys     0m2.519s
+    => missing: 4
+    => min: 577
+    => max: 14K
+    => mean:
+    => total: 1540K
+
+    time ls | parallel -j1 convert -resize 200x200 {}[0] /tmp/magick-png/{}.png
+    => missing: 4
+    real    1m19.399s
+    user    1m17.150s
+    sys     0m6.322s
+    => min: 1.1K
+    => max: 325K
+    => mean:
+    => total: 8476K
+
+    time ls | parallel -j1 convert -resize 200x200 {}[0] /tmp/magick-jpeg/{}.jpg
+    real    1m21.766s
+    user    1m17.040s
+    sys     0m7.155s
+    => total: 3484K
+
+NOTE: the following `pdf_thumbnail.py` images are somewhat smaller than the above
+jpg and pngs (max 180px wide, not 200px wide)
+
+    time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-png/{}.png
+    real    0m48.198s
+    user    0m42.997s
+    sys     0m4.509s
+    => missing: 2; 2x additional stub images
+    => total: 5904K
+
+    time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg/{}.jpg
+    real    0m45.252s
+    user    0m41.232s
+    sys     0m4.273s
+    => min: 1.4K
+    => max: 16K
+    => mean: ~9.3KByte
+    => total: 1772K
+
+    time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg-360/{}.jpg
+    real    0m48.639s
+    user    0m44.121s
+    sys     0m4.568s
+    => mean: ~28k
+    => total: 5364K (3x of 180px batch)
+
+    quality=95
+    time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg2-360/{}.jpg
+    real    0m49.407s
+    user    0m44.607s
+    sys     0m4.869s
+    => total: 9812K
+
+    quality=95
+    time ls | parallel -j1 ~/code/sandcrawler/python/scripts/pdf_thumbnail.py {} /tmp/python-jpg2-180/{}.jpg
+    real    0m45.901s
+    user    0m41.486s
+    sys     0m4.591s
+    => mean: 16.4K
+    => total: 3116K
+
+At the 180px size, the difference between default and quality=95 seems
+indistinguishable visually to me, but is more than a doubling of file size.
+Also tried at 300px and seems near-indistinguishable there as well.
+
+At a mean of 10 Kbytes per file:
+
+    10  million -> 100 GBytes
+    100 million -> 1 Tbyte
+
+Older COVID-19 thumbnails were about 400px wide:
+
+    pdftocairo -png -singlefile -scale-to-x 400 -scale-to-y -1
+
+Display on scholar-qa.archive.org is about 135x181px
+
+archive.org does 180px wide
+
+Unclear if we should try to do double resolution for high DPI screens (eg,
+apple "retina").
+
+Using same size as archive.org probably makes the most sense: max 180px wide,
+preserve aspect ratio. And jpeg improvement seems worth it.
+
+#### Merlijn notes
+
+From work on optimizing microfilm thumbnail images:
+
+    When possible, generate a thumbnail that fits well on the screen of the
+    user.  Always creating a large thumbnail will result in the browsers
+    downscaling them, leading to fuzzy text. If it’s not possible, then create
+    the pick the resolution you’d want to support (1.5x or 2x scaling) and
+    create thumbnails of that size, but also apply the other recommendations
+    below - especially a sharpening filter.
+
+    Use bicubic or lanczos interpolation. Bilinear and nearest neighbour are
+    not OK.
+
+    For text, consider applying a sharpening filter. Not a strong one, but some
+    sharpening can definitely help.
+
+
+## Appendix: PDF Info Fields
+
+From `pdfinfo` manpage:
+
+    The ´Info' dictionary contains the following values:
+
+        title
+        subject
+        keywords
+        author
+        creator
+        producer
+        creation date
+        modification date
+
+    In addition, the following information is printed:
+
+        tagged (yes/no)
+        form (AcroForm / XFA / none)
+        javascript (yes/no)
+        page count
+        encrypted flag (yes/no)
+        print and copy permissions (if encrypted)
+        page size
+        file size
+        linearized (yes/no)
+        PDF version
+        metadata (only if requested)
+
+For an example file, the output looks like:
+
+    Title:          A mountable toilet system for personalized health monitoring via the analysis of excreta
+    Subject:        Nature Biomedical Engineering, doi:10.1038/s41551-020-0534-9
+    Keywords:       
+    Author:         Seung-min Park
+    Creator:        Springer
+    CreationDate:   Thu Mar 26 01:26:57 2020 PDT
+    ModDate:        Thu Mar 26 01:28:06 2020 PDT
+    Tagged:         no
+    UserProperties: no
+    Suspects:       no
+    Form:           AcroForm
+    JavaScript:     no
+    Pages:          14
+    Encrypted:      no
+    Page size:      595.276 x 790.866 pts
+    Page rot:       0
+    File size:      6104749 bytes
+    Optimized:      yes
+    PDF version:    1.4
+
+For context on the `pdf_id` fields ("original" and "updated"), read:
+<https://web.hypothes.is/blog/synchronizing-annotations-between-local-and-remote-pdfs/>
author	Bryan Newbold <bnewbold@archive.org>	2020-06-17 13:30:24 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2020-06-17 13:30:24 -0700
commit	28ded59c2f1e86a7f044d2e0e0fd7ecc9df09115 (patch)
tree	fcc4dead31b6140a92ab692927926e29ff6dc9a8 /proposals
parent	2a42baa521b5a88863ac0575305a21195518da11 (diff)
download	sandcrawler-28ded59c2f1e86a7f044d2e0e0fd7ecc9df09115.tar.gz sandcrawler-28ded59c2f1e86a7f044d2e0e0fd7ecc9df09115.zip