summaryrefslogtreecommitdiffstats
path: root/proposals/2020-05-11_microfilm_indexing_pipeline.md
blob: 657aae2c3e16365e45caaef11e267ae295c7c7a2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

## High-Level

- operate on an entire item
- check against issue DB and/or fatcat search
    => if there is fatcat work-level metadata for this issue, skip
- fetch collection-level (journal) metadata
- iterate through djvu text file:
    => convert to simple text
    => filter out non-research pages using quick heuristics
    => try looking up "real" page number from OCR work (in item metadata)
- generate "heavy" intermediate schema (per valid page):
    => fatcat container metadata
    => ia collection (journal) metadata
    => item metadata
    => page fulltext and any metadata

- transform "heavy" intermediates to ES schema

## Implementation

Existing tools and libraries:

- internetarchive python tool to fetch files and item metadata
- fatcat API client for container metadata lookup

New tools or libraries needed:

- issue DB or use fatcat search index to count releases by volume/issue
- djvu XML parser