proposals/20200207_pdftrio.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107


status: deployed

NOTE: while this has been used in production, as of December 2022 the results
are not used much in practice, and we don't score every PDF that comes along

PDF Trio (ML Classification)
==============================

This document describes how we intent to integrate the first generation of PDF
classification work into the sandcrawler processing system.

- abstractions (APIs)
- schemas
- how models and dependencies are deployed
- what code is release where under what license


## Code Structure

Major components:

**Training code, documentation, datasets:** Not used at run-time (does not need
to be deployed). Should be public. The datasets (PDFs) are copyrighted, so we
should only release URL lists that point to wayback.

**Models:** all are static, uploaded to archive.org items, simple download to
deploy. Should be versioned, and have unique versioned file names or directory
paths (aka, deploy in parallel).

**Image classifier backend:** vanilla tensorflow serving docker image, with a
bunch of invocation configs, plus static models.

**BERT backend:** vanilla tensorflow serving docker image, plus config, plus
models. Basically same as image classifier.

**API service:** currently Flask. Depends on tools like imagemagik, fasttext,
pdftotext. Seems like apt+pipenv should work?


## API Refactors

Changes:

- probably re-write README?
- refactor python code into directories
- add python tests
- tweak schema
- proper parallelization: uwsgi? async?

New features:

- option to send images, raw text in batches in addition to PDFs.

## Client Code

Basically just like GROBID client for now. Requests, JSON.

## JSON Schema

Output that goes in Kafka topic:

    key (sha1hex)
    pdf_trio
        status
        status_code
        ensemble_score
        bert_score
        image_score
        linear_score
        versions
            pdftrio_version (string)
            models_date (string, ISO date)
            git_rev (string)
            bert_model (string)
            image_model (string)
            linear_model (string)
        timing (optional/future: as reported by API)
            ...
    file_meta
        sha1hex
        ...
    timing
        ...


## SQL Schema

Ensemble model versions are summarized as a date.

    CREATE TABLE IF NOT EXISTS pdftrio (
        sha1hex             TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
        updated             TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
        status_code         INT NOT NULL,
        status              TEXT CHECK (octet_length(status) >= 1) NOT NULL,
        pdftrio_version     TEXT CHECK (octet_length(pdftrio_version) >= 1),
        models_date         DATE,
        ensemble_score      REAL,
        bert_score          REAL,
        linear_score        REAL,
        image_score         REAL
    );

## Kafka Topic

sandcrawler-qa.pdftrio-output