proposals/20200207_pdftrio.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104


status: in progress

PDF Trio (ML Classification)
==============================

This document describes how we intent to integrate the first generation of PDF
classification work into the sandcrawler processing system.

- abstractions (APIs)
- schemas
- how models and dependencies are deployed
- what code is release where under what license


## Code Structure

Major components:

**Training code, documentation, datasets:** Not used at run-time (does not need
to be deployed). Should be public. The datasets (PDFs) are copyrighted, so we
should only release URL lists that point to wayback.

**Models:** all are static, uploaded to archive.org items, simple download to
deploy. Should be versioned, and have unique versioned file names or directory
paths (aka, deploy in parallel).

**Image classifier backend:** vanilla tensorflow serving docker image, with a
bunch of invocation configs, plus static models.

**BERT backend:** vanilla tensorflow serving docker image, plus config, plus
models. Basically same as image classifier.

**API service:** currently Flask. Depends on tools like imagemagik, fasttext,
pdftotext. Seems like apt+pipenv should work?


## API Refactors

Changes:

- probably re-write README?
- refactor python code into directories
- add python tests
- tweak schema
- proper parallelization: uwsgi? async?

New features:

- option to send images, raw text in batches in addition to PDFs.

## Client Code

Basically just like GROBID client for now. Requests, JSON.

## JSON Schema

Output that goes in Kafka topic:

    key (sha1hex)
    pdf_trio
        status
        status_code
        ensemble_score
        bert_score
        image_score
        linear_score
        versions
            pdftrio_version (string)
            models_date (string, ISO date)
            git_rev (string)
            bert_model (string)
            image_model (string)
            linear_model (string)
        timing (optional/future: as reported by API)
            ...
    file_meta
        sha1hex
        ...
    timing
        ...


## SQL Schema

Ensemble model versions are summarized as a date.

    CREATE TABLE IF NOT EXISTS pdftrio (
        sha1hex             TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
        updated             TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
        status_code         INT NOT NULL,
        status              TEXT CHECK (octet_length(status) >= 1) NOT NULL,
        pdftrio_version     TEXT CHECK (octet_length(pdftrio_version) >= 1),
        models_date         DATE,
        ensemble_score      REAL,
        bert_score          REAL,
        linear_score        REAL,
        image_score         REAL
    );

## Kafka Topic

sandcrawler-qa.pdftrio-output