aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2019-12-11 18:20:27 -0800
committerBryan Newbold <bnewbold@archive.org>2019-12-11 18:20:27 -0800
commit85cfba5325a4c587f524c16499b4ab9f48de07c5 (patch)
tree3c8888833b21975e6b2cb241af9eb229ed5b2723
parent91f5f53c90742c80890e3bd44fdc9044555b8209 (diff)
downloadsandcrawler-85cfba5325a4c587f524c16499b4ab9f48de07c5.tar.gz
sandcrawler-85cfba5325a4c587f524c16499b4ab9f48de07c5.zip
pdftotext proposal
-rw-r--r--proposals/2019_pdftotext_pdfinfo.md123
1 files changed, 123 insertions, 0 deletions
diff --git a/proposals/2019_pdftotext_pdfinfo.md b/proposals/2019_pdftotext_pdfinfo.md
new file mode 100644
index 0000000..ed731a4
--- /dev/null
+++ b/proposals/2019_pdftotext_pdfinfo.md
@@ -0,0 +1,123 @@
+
+status: brainstorming/backburner
+
+last updated: 2019-12-11
+
+This document proposes changes to extract text and metadata from PDFs at ingest
+time using pdftotext and pdfinfo, and storing this content in SQL and minio.
+
+This isn't a priority at the moment. Could be useful for fulltext search when
+GROBID fails, and the pdfinfo output might help with other quality checks.
+
+## Overview / Motivation
+
+`pdfinfo` and `pdftotext` can both be run quickly over raw PDFs. In
+sandcrawler, fetching PDFs can be a bit slow, so the motivation for caching the
+text is just to not have to fetch the PDFs over and over. Metadata is useful to
+store and index at scale.
+
+## pdfinfo output
+
+Example PDF info outputs:
+
+ Creator: PDF Suite 2010
+ Producer: PDF Suite 2010
+ CreationDate: Tue Sep 24 23:03:58 2013 PDT
+ ModDate: Tue Sep 24 23:03:58 2013 PDT
+ Tagged: no
+ UserProperties: no
+ Suspects: no
+ Form: none
+ JavaScript: no
+ Pages: 17
+ Encrypted: no
+ Page size: 612 x 792 pts (letter)
+ Page rot: 0
+ File size: 105400 bytes
+ Optimized: no
+ PDF version: 1.4
+
+another:
+
+ Title: Miscellanea Zoologica Hungarica 8. 1993 (Budapest, 1993)
+ Author: L. Forró szerk.
+ Producer: ABBYY FineReader 9.0 Corporate Edition
+ CreationDate: Wed Apr 13 05:30:21 2011 PDT
+ ModDate: Wed Apr 13 09:53:27 2011 PDT
+ Tagged: yes
+ UserProperties: no
+ Suspects: no
+ Form: AcroForm
+ JavaScript: no
+ Pages: 13
+ Encrypted: no
+ Page size: 473.76 x 678.42 pts
+ Page rot: 0
+ File size: 12047270 bytes
+ Optimized: no
+ PDF version: 1.6
+
+With the `-meta` flag, you get XML output, which also includes:
+
+ <xmpMM:DocumentID>uuid:cd1a8daa-61e1-48f4-b679-26eac52bb6a9</xmpMM:DocumentID>
+ <xmpMM:InstanceID>uuid:dea54c78-8bc6-4f2f-a665-4cd7e62457e7</xmpMM:InstanceID>
+
+The document id is particularly interesting for fatcat/sandcrawler. Apparently
+it is randomly created (or based on md5?) of first version of the file, and
+persists across edits. A quality check would be that all files with the same
+`document_id` should be clustered under the same fatcat work.
+
+All the info fields could probably be combined and used in categorization and
+filtering (ML or heuristic). Eg, a PDF with forms is probably not research
+output; published PDFs with specific "Producer" software probably are.
+
+## Fatcat Changes
+
+Could include in entity fields, a `pdfinfo` JSONB field, or existing `extra`:
+
+- pages
+- words
+- document id
+- page size
+- created
+- other meta (eg, PDF title, author, etc)
+
+All of these fields are, I assume, deterministic, thus appropriate for
+inclusion in fatcat.
+
+## New SQL Tables
+
+ CREATE TABLE IF NOT EXISTS pdftotext (
+ sha1hex TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
+ updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
+ tool_version TEXT CHECK (octet_length(tool_version) >= 1),
+ text_success BOOLEAN NOT NULL,
+ text_words INT,
+ info_success BOOLEAN NOT NULL,
+ pages INT,
+ pdf_created TIMESTAMP WITH TIME ZONE,
+ document_id TEXT CHECK (octet_length(document_id) >= 1), -- XXX: always UUID?
+ metadata JSONB
+ -- metadata contains any other stuff from pdfinfo:
+ -- title
+ -- author
+ -- pdf version
+ -- page size (?)
+ -- instance_id
+ );
+ -- CREATE INDEX pdftotext ON pdftotext(document_id);
+
+## New Kafka Topics
+
+ sandcrawler-ENV.pdftotext-output
+
+Key would be sha1hex of PDF.
+
+Schema would match the SQL table, plus the full raw PDF text output.
+
+## New Minio Stuff
+
+ /pdftotext/<hexbyte0>/<hexbyte1>/<sha1hex>.txt
+
+## Open Questions
+