1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
|
status: in progress
PDF Trio (ML Classification)
==============================
This document describes how we intent to integrate the first generation of PDF
classification work into the sandcrawler processing system.
- abstractions (APIs)
- schemas
- how models and dependencies are deployed
- what code is release where under what license
## Code Structure
Major components:
**Training code, documentation, datasets:** Not used at run-time (does not need
to be deployed). Should be public. The datasets (PDFs) are copyrighted, so we
should only release URL lists that point to wayback.
**Models:** all are static, uploaded to archive.org items, simple download to
deploy. Should be versioned, and have unique versioned file names or directory
paths (aka, deploy in parallel).
**Image classifier backend:** vanilla tensorflow serving docker image, with a
bunch of invocation configs, plus static models.
**BERT backend:** vanilla tensorflow serving docker image, plus config, plus
models. Basically same as image classifier.
**API service:** currently Flask. Depends on tools like imagemagik, fasttext,
pdftotext. Seems like apt+pipenv should work?
## API Refactors
Changes:
- probably re-write README?
- refactor python code into directories
- add python tests
- tweak schema
- proper parallelization: uwsgi? async?
New features:
- option to send images, raw text in batches in addition to PDFs.
## Client Code
Basically just like GROBID client for now. Requests, JSON.
## JSON Schema
Output that goes in Kafka topic:
key (sha1hex)
pdf_trio
status
status_code
ensemble_score
bert_score
image_score
linear_score
versions
pdftrio_version (string)
models_date (string, ISO date)
git_rev (string)
bert_model (string)
image_model (string)
linear_model (string)
timing (optional/future: as reported by API)
...
file_meta
sha1hex
...
timing
...
## SQL Schema
Ensemble model versions are summarized as a date.
CREATE TABLE IF NOT EXISTS pdftrio (
sha1hex TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
status_code INT NOT NULL,
status TEXT CHECK (octet_length(status) >= 1) NOT NULL,
pdftrio_version TEXT CHECK (octet_length(pdftrio_version) >= 1),
models_date DATE,
ensemble_score REAL,
bert_score REAL,
linear_score REAL,
image_score REAL
);
## Kafka Topic
sandcrawler-qa.pdftrio-output
|