1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
|
status: brainstorming/backburner
last updated: 2019-12-11
This document proposes changes to extract text and metadata from PDFs at ingest
time using pdftotext and pdfinfo, and storing this content in SQL and minio.
This isn't a priority at the moment. Could be useful for fulltext search when
GROBID fails, and the pdfinfo output might help with other quality checks.
## Overview / Motivation
`pdfinfo` and `pdftotext` can both be run quickly over raw PDFs. In
sandcrawler, fetching PDFs can be a bit slow, so the motivation for caching the
text is just to not have to fetch the PDFs over and over. Metadata is useful to
store and index at scale.
## pdfinfo output
Example PDF info outputs:
Creator: PDF Suite 2010
Producer: PDF Suite 2010
CreationDate: Tue Sep 24 23:03:58 2013 PDT
ModDate: Tue Sep 24 23:03:58 2013 PDT
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 17
Encrypted: no
Page size: 612 x 792 pts (letter)
Page rot: 0
File size: 105400 bytes
Optimized: no
PDF version: 1.4
another:
Title: Miscellanea Zoologica Hungarica 8. 1993 (Budapest, 1993)
Author: L. Forró szerk.
Producer: ABBYY FineReader 9.0 Corporate Edition
CreationDate: Wed Apr 13 05:30:21 2011 PDT
ModDate: Wed Apr 13 09:53:27 2011 PDT
Tagged: yes
UserProperties: no
Suspects: no
Form: AcroForm
JavaScript: no
Pages: 13
Encrypted: no
Page size: 473.76 x 678.42 pts
Page rot: 0
File size: 12047270 bytes
Optimized: no
PDF version: 1.6
With the `-meta` flag, you get XML output, which also includes:
<xmpMM:DocumentID>uuid:cd1a8daa-61e1-48f4-b679-26eac52bb6a9</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:dea54c78-8bc6-4f2f-a665-4cd7e62457e7</xmpMM:InstanceID>
The document id is particularly interesting for fatcat/sandcrawler. Apparently
it is randomly created (or based on md5?) of first version of the file, and
persists across edits. A quality check would be that all files with the same
`document_id` should be clustered under the same fatcat work.
All the info fields could probably be combined and used in categorization and
filtering (ML or heuristic). Eg, a PDF with forms is probably not research
output; published PDFs with specific "Producer" software probably are.
## Fatcat Changes
Could include in entity fields, a `pdfinfo` JSONB field, or existing `extra`:
- pages
- words
- document id
- page size
- created
- other meta (eg, PDF title, author, etc)
All of these fields are, I assume, deterministic, thus appropriate for
inclusion in fatcat.
## New SQL Tables
CREATE TABLE IF NOT EXISTS pdftotext (
sha1hex TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
tool_version TEXT CHECK (octet_length(tool_version) >= 1),
text_success BOOLEAN NOT NULL,
text_words INT,
info_success BOOLEAN NOT NULL,
pages INT,
pdf_created TIMESTAMP WITH TIME ZONE,
document_id TEXT CHECK (octet_length(document_id) >= 1), -- XXX: always UUID?
metadata JSONB
-- metadata contains any other stuff from pdfinfo:
-- title
-- author
-- pdf version
-- page size (?)
-- instance_id
);
-- CREATE INDEX pdftotext ON pdftotext(document_id);
## New Kafka Topics
sandcrawler-ENV.pdftotext-output
Key would be sha1hex of PDF.
Schema would match the SQL table, plus the full raw PDF text output.
## New Minio Stuff
/pdftotext/<hexbyte0>/<hexbyte1>/<sha1hex>.txt
## Open Questions
|