status: brainstorming/backburner last updated: 2019-12-11 This document proposes changes to extract text and metadata from PDFs at ingest time using pdftotext and pdfinfo, and storing this content in SQL and minio. This isn't a priority at the moment. Could be useful for fulltext search when GROBID fails, and the pdfinfo output might help with other quality checks. ## Overview / Motivation `pdfinfo` and `pdftotext` can both be run quickly over raw PDFs. In sandcrawler, fetching PDFs can be a bit slow, so the motivation for caching the text is just to not have to fetch the PDFs over and over. Metadata is useful to store and index at scale. ## pdfinfo output Example PDF info outputs: Creator: PDF Suite 2010 Producer: PDF Suite 2010 CreationDate: Tue Sep 24 23:03:58 2013 PDT ModDate: Tue Sep 24 23:03:58 2013 PDT Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 17 Encrypted: no Page size: 612 x 792 pts (letter) Page rot: 0 File size: 105400 bytes Optimized: no PDF version: 1.4 another: Title: Miscellanea Zoologica Hungarica 8. 1993 (Budapest, 1993) Author: L. Forró szerk. Producer: ABBYY FineReader 9.0 Corporate Edition CreationDate: Wed Apr 13 05:30:21 2011 PDT ModDate: Wed Apr 13 09:53:27 2011 PDT Tagged: yes UserProperties: no Suspects: no Form: AcroForm JavaScript: no Pages: 13 Encrypted: no Page size: 473.76 x 678.42 pts Page rot: 0 File size: 12047270 bytes Optimized: no PDF version: 1.6 With the `-meta` flag, you get XML output, which also includes: uuid:cd1a8daa-61e1-48f4-b679-26eac52bb6a9 uuid:dea54c78-8bc6-4f2f-a665-4cd7e62457e7 The document id is particularly interesting for fatcat/sandcrawler. Apparently it is randomly created (or based on md5?) of first version of the file, and persists across edits. A quality check would be that all files with the same `document_id` should be clustered under the same fatcat work. All the info fields could probably be combined and used in categorization and filtering (ML or heuristic). Eg, a PDF with forms is probably not research output; published PDFs with specific "Producer" software probably are. ## Fatcat Changes Could include in entity fields, a `pdfinfo` JSONB field, or existing `extra`: - pages - words - document id - page size - created - other meta (eg, PDF title, author, etc) All of these fields are, I assume, deterministic, thus appropriate for inclusion in fatcat. ## New SQL Tables CREATE TABLE IF NOT EXISTS pdftotext ( sha1hex TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40), updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL, tool_version TEXT CHECK (octet_length(tool_version) >= 1), text_success BOOLEAN NOT NULL, text_words INT, info_success BOOLEAN NOT NULL, pages INT, pdf_created TIMESTAMP WITH TIME ZONE, document_id TEXT CHECK (octet_length(document_id) >= 1), -- XXX: always UUID? metadata JSONB -- metadata contains any other stuff from pdfinfo: -- title -- author -- pdf version -- page size (?) -- instance_id ); -- CREATE INDEX pdftotext ON pdftotext(document_id); ## New Kafka Topics sandcrawler-ENV.pdftotext-output Key would be sha1hex of PDF. Schema would match the SQL table, plus the full raw PDF text output. ## New Minio Stuff /pdftotext///.txt ## Open Questions