1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
|
status: deployed
NOTE: while this has been used in production, as of December 2022 the results
are not used much in practice, and we don't score every PDF that comes along
PDF Trio (ML Classification)
==============================
This document describes how we intent to integrate the first generation of PDF
classification work into the sandcrawler processing system.
- abstractions (APIs)
- schemas
- how models and dependencies are deployed
- what code is release where under what license
## Code Structure
Major components:
**Training code, documentation, datasets:** Not used at run-time (does not need
to be deployed). Should be public. The datasets (PDFs) are copyrighted, so we
should only release URL lists that point to wayback.
**Models:** all are static, uploaded to archive.org items, simple download to
deploy. Should be versioned, and have unique versioned file names or directory
paths (aka, deploy in parallel).
**Image classifier backend:** vanilla tensorflow serving docker image, with a
bunch of invocation configs, plus static models.
**BERT backend:** vanilla tensorflow serving docker image, plus config, plus
models. Basically same as image classifier.
**API service:** currently Flask. Depends on tools like imagemagik, fasttext,
pdftotext. Seems like apt+pipenv should work?
## API Refactors
Changes:
- probably re-write README?
- refactor python code into directories
- add python tests
- tweak schema
- proper parallelization: uwsgi? async?
New features:
- option to send images, raw text in batches in addition to PDFs.
## Client Code
Basically just like GROBID client for now. Requests, JSON.
## JSON Schema
Output that goes in Kafka topic:
key (sha1hex)
pdf_trio
status
status_code
ensemble_score
bert_score
image_score
linear_score
versions
pdftrio_version (string)
models_date (string, ISO date)
git_rev (string)
bert_model (string)
image_model (string)
linear_model (string)
timing (optional/future: as reported by API)
...
file_meta
sha1hex
...
timing
...
## SQL Schema
Ensemble model versions are summarized as a date.
CREATE TABLE IF NOT EXISTS pdftrio (
sha1hex TEXT PRIMARY KEY CHECK (octet_length(sha1hex) = 40),
updated TIMESTAMP WITH TIME ZONE DEFAULT now() NOT NULL,
status_code INT NOT NULL,
status TEXT CHECK (octet_length(status) >= 1) NOT NULL,
pdftrio_version TEXT CHECK (octet_length(pdftrio_version) >= 1),
models_date DATE,
ensemble_score REAL,
bert_score REAL,
linear_score REAL,
image_score REAL
);
## Kafka Topic
sandcrawler-qa.pdftrio-output
|