From f5a883642dd114ac2c29c72348bed05616189aa2 Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Mon, 11 May 2020 19:12:13 -0700
Subject: start sketching proposals

---
 proposals/microfilm_indexing_pipeline.md | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)
 create mode 100644 proposals/microfilm_indexing_pipeline.md

(limited to 'proposals/microfilm_indexing_pipeline.md')

diff --git a/proposals/microfilm_indexing_pipeline.md b/proposals/microfilm_indexing_pipeline.md
new file mode 100644
index 0000000..657aae2
--- /dev/null
+++ b/proposals/microfilm_indexing_pipeline.md
@@ -0,0 +1,30 @@
+
+## High-Level
+
+- operate on an entire item
+- check against issue DB and/or fatcat search
+    => if there is fatcat work-level metadata for this issue, skip
+- fetch collection-level (journal) metadata
+- iterate through djvu text file:
+    => convert to simple text
+    => filter out non-research pages using quick heuristics
+    => try looking up "real" page number from OCR work (in item metadata)
+- generate "heavy" intermediate schema (per valid page):
+    => fatcat container metadata
+    => ia collection (journal) metadata
+    => item metadata
+    => page fulltext and any metadata
+
+- transform "heavy" intermediates to ES schema
+
+## Implementation
+
+Existing tools and libraries:
+
+- internetarchive python tool to fetch files and item metadata
+- fatcat API client for container metadata lookup
+
+New tools or libraries needed:
+
+- issue DB or use fatcat search index to count releases by volume/issue
+- djvu XML parser
-- 
cgit v1.2.3