proposals/2019_pdftotext_pdfinfo.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123


status: brainstorming/backburner

last updated: 2019-12-11

This document proposes changes to extract text and metadata from PDFs at ingest
time using pdftotext and pdfinfo, and storing this content in SQL and minio.

This isn't a priority at the moment. Could be useful for fulltext search when
GROBID fails, and the pdfinfo output might help with other quality checks.

## Overview / Motivation

`pdfinfo` and `pdftotext` can both be run quickly over raw PDFs. In
sandcrawler, fetching PDFs can be a bit slow, so the motivation for caching the
text is just to not have to fetch the PDFs over and over. Metadata is useful to
store and index at scale.

## pdfinfo output

Example PDF info outputs:

    Creator:        PDF Suite 2010
    Producer:       PDF Suite 2010
    CreationDate:   Tue Sep 24 23:03:58 2013 PDT
    ModDate:        Tue Sep 24 23:03:58 2013 PDT
    Tagged:         no
    UserProperties: no
    Suspects:       no
    Form:           none
    JavaScript:     no
    Pages:          17
    Encrypted:      no
    Page size:      612 x 792 pts (letter)
    Page rot:       0
    File size:      105400 bytes
    Optimized:      no
    PDF version:    1.4

another:

    Title:          Miscellanea Zoologica Hungarica 8. 1993 (Budapest, 1993)
    Author:         L. Forró szerk.
    Producer:       ABBYY FineReader 9.0 Corporate Edition
    CreationDate:   Wed Apr 13 05:30:21 2011 PDT
    ModDate:        Wed Apr 13 09:53:27 2011 PDT
    Tagged:         yes
    UserProperties: no
    Suspects:       no
    Form:           AcroForm
    JavaScript:     no
    Pages:          13
    Encrypted:      no
    Page size:      473.76 x 678.42 pts
    Page rot:       0
    File size:      12047270 bytes
    Optimized:      no
    PDF version:    1.6

With the `-meta` flag, you get XML output, which also includes:

    <xmpMM:DocumentID>uuid:cd1a8daa-61e1-48f4-b679-26eac52bb6a9</xmpMM:DocumentID>
    <xmpMM:InstanceID>uuid:dea54c78-8bc6-4f2f-a665-4cd7e62457e7</xmpMM:InstanceID>

The document id is particularly interesting for fatcat/sandcrawler. Apparently
it is randomly created (or based on md5?) of first version of the file, and
persists across edits. A quality check would be that all files with the same
`document_id` should be clustered under the same fatcat work.

All the info fields could probably be combined and used in categorization and
filtering (ML or heuristic). Eg, a PDF with forms is probably not research
output; published PDFs with specific "Producer" software probably are.

## Fatcat Changes

Could include in entity fields, a `pdfinfo` JSONB field, or existing `extra`:

- pages
- words
- document id
- page size
- created
- other meta (eg, PDF title, author, etc)

All of these fields are, I assume, deterministic, thus appropriate for
inclusion in fatcat.

## New SQL Tables

    CREATE TABLE IF NOT EXISTS pdftotext (
        sha1hex             TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
        updated             TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
        tool_version        TEXT CHECK (octet_length(tool_version) >= 1),
        text_success        BOOLEAN NOT NULL,
        text_words          INT,
        info_success        BOOLEAN NOT NULL,
        pages               INT,
        pdf_created         TIMESTAMP WITH TIME ZONE,
        document_id         TEXT CHECK (octet_length(document_id) >= 1), -- XXX: always UUID?
        metadata            JSONB
        -- metadata contains any other stuff from pdfinfo:
        --  title
        --  author
        --  pdf version
        --  page size (?)
        --  instance_id
    );
    -- CREATE INDEX pdftotext ON pdftotext(document_id);

## New Kafka Topics

    sandcrawler-ENV.pdftotext-output

Key would be sha1hex of PDF.

Schema would match the SQL table, plus the full raw PDF text output.

## New Minio Stuff

    /pdftotext/<hexbyte0>/<hexbyte1>/<sha1hex>.txt

## Open Questions