From 5defd444135bc4adb0748b0d2b8c9b88708bdc1a Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 23 Mar 2021 21:42:32 -0700 Subject: proposals: add 2021 UI updates, and rename all to have a date in filename --- .../2020-05-11_microfilm_indexing_pipeline.md | 30 ++++++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 proposals/2020-05-11_microfilm_indexing_pipeline.md (limited to 'proposals/2020-05-11_microfilm_indexing_pipeline.md') diff --git a/proposals/2020-05-11_microfilm_indexing_pipeline.md b/proposals/2020-05-11_microfilm_indexing_pipeline.md new file mode 100644 index 0000000..657aae2 --- /dev/null +++ b/proposals/2020-05-11_microfilm_indexing_pipeline.md @@ -0,0 +1,30 @@ + +## High-Level + +- operate on an entire item +- check against issue DB and/or fatcat search + => if there is fatcat work-level metadata for this issue, skip +- fetch collection-level (journal) metadata +- iterate through djvu text file: + => convert to simple text + => filter out non-research pages using quick heuristics + => try looking up "real" page number from OCR work (in item metadata) +- generate "heavy" intermediate schema (per valid page): + => fatcat container metadata + => ia collection (journal) metadata + => item metadata + => page fulltext and any metadata + +- transform "heavy" intermediates to ES schema + +## Implementation + +Existing tools and libraries: + +- internetarchive python tool to fetch files and item metadata +- fatcat API client for container metadata lookup + +New tools or libraries needed: + +- issue DB or use fatcat search index to count releases by volume/issue +- djvu XML parser -- cgit v1.2.3