From f5a883642dd114ac2c29c72348bed05616189aa2 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Mon, 11 May 2020 19:12:13 -0700 Subject: start sketching proposals --- proposals/microfilm_indexing_pipeline.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 proposals/microfilm_indexing_pipeline.md (limited to 'proposals/microfilm_indexing_pipeline.md') diff --git a/proposals/microfilm_indexing_pipeline.md b/proposals/microfilm_indexing_pipeline.md new file mode 100644 index 0000000..657aae2 --- /dev/null +++ b/proposals/microfilm_indexing_pipeline.md @@ -0,0 +1,30 @@ + +## High-Level + +- operate on an entire item +- check against issue DB and/or fatcat search + => if there is fatcat work-level metadata for this issue, skip +- fetch collection-level (journal) metadata +- iterate through djvu text file: + => convert to simple text + => filter out non-research pages using quick heuristics + => try looking up "real" page number from OCR work (in item metadata) +- generate "heavy" intermediate schema (per valid page): + => fatcat container metadata + => ia collection (journal) metadata + => item metadata + => page fulltext and any metadata + +- transform "heavy" intermediates to ES schema + +## Implementation + +Existing tools and libraries: + +- internetarchive python tool to fetch files and item metadata +- fatcat API client for container metadata lookup + +New tools or libraries needed: + +- issue DB or use fatcat search index to count releases by volume/issue +- djvu XML parser -- cgit v1.2.3