blob: 657aae2c3e16365e45caaef11e267ae295c7c7a2 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
## High-Level
- operate on an entire item
- check against issue DB and/or fatcat search
=> if there is fatcat work-level metadata for this issue, skip
- fetch collection-level (journal) metadata
- iterate through djvu text file:
=> convert to simple text
=> filter out non-research pages using quick heuristics
=> try looking up "real" page number from OCR work (in item metadata)
- generate "heavy" intermediate schema (per valid page):
=> fatcat container metadata
=> ia collection (journal) metadata
=> item metadata
=> page fulltext and any metadata
- transform "heavy" intermediates to ES schema
## Implementation
Existing tools and libraries:
- internetarchive python tool to fetch files and item metadata
- fatcat API client for container metadata lookup
New tools or libraries needed:
- issue DB or use fatcat search index to count releases by volume/issue
- djvu XML parser
|