summaryrefslogtreecommitdiffstats
path: root/proposals/2020-05-11_microfilm_indexing_pipeline.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-03-23 21:42:32 -0700
committerBryan Newbold <bnewbold@archive.org>2021-03-23 21:42:32 -0700
commit5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch)
tree599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/2020-05-11_microfilm_indexing_pipeline.md
parente70e7cff4b5c910405694fb297330507b49937b1 (diff)
downloadfatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz
fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip
proposals: add 2021 UI updates, and rename all to have a date in filename
Diffstat (limited to 'proposals/2020-05-11_microfilm_indexing_pipeline.md')
-rw-r--r--proposals/2020-05-11_microfilm_indexing_pipeline.md30
1 files changed, 30 insertions, 0 deletions
diff --git a/proposals/2020-05-11_microfilm_indexing_pipeline.md b/proposals/2020-05-11_microfilm_indexing_pipeline.md
new file mode 100644
index 0000000..657aae2
--- /dev/null
+++ b/proposals/2020-05-11_microfilm_indexing_pipeline.md
@@ -0,0 +1,30 @@
+
+## High-Level
+
+- operate on an entire item
+- check against issue DB and/or fatcat search
+ => if there is fatcat work-level metadata for this issue, skip
+- fetch collection-level (journal) metadata
+- iterate through djvu text file:
+ => convert to simple text
+ => filter out non-research pages using quick heuristics
+ => try looking up "real" page number from OCR work (in item metadata)
+- generate "heavy" intermediate schema (per valid page):
+ => fatcat container metadata
+ => ia collection (journal) metadata
+ => item metadata
+ => page fulltext and any metadata
+
+- transform "heavy" intermediates to ES schema
+
+## Implementation
+
+Existing tools and libraries:
+
+- internetarchive python tool to fetch files and item metadata
+- fatcat API client for container metadata lookup
+
+New tools or libraries needed:
+
+- issue DB or use fatcat search index to count releases by volume/issue
+- djvu XML parser