diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-05-11 19:12:13 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-05-11 19:12:13 -0700 |
commit | f5a883642dd114ac2c29c72348bed05616189aa2 (patch) | |
tree | a6952af6c83529f563c34197fb269f55615e01f7 /proposals/microfilm_indexing_pipeline.md | |
parent | b5a8d71d6ca1f54c4ba0e558d021e347ec634319 (diff) | |
download | fatcat-scholar-f5a883642dd114ac2c29c72348bed05616189aa2.tar.gz fatcat-scholar-f5a883642dd114ac2c29c72348bed05616189aa2.zip |
start sketching proposals
Diffstat (limited to 'proposals/microfilm_indexing_pipeline.md')
-rw-r--r-- | proposals/microfilm_indexing_pipeline.md | 30 |
1 files changed, 30 insertions, 0 deletions
diff --git a/proposals/microfilm_indexing_pipeline.md b/proposals/microfilm_indexing_pipeline.md new file mode 100644 index 0000000..657aae2 --- /dev/null +++ b/proposals/microfilm_indexing_pipeline.md @@ -0,0 +1,30 @@ + +## High-Level + +- operate on an entire item +- check against issue DB and/or fatcat search + => if there is fatcat work-level metadata for this issue, skip +- fetch collection-level (journal) metadata +- iterate through djvu text file: + => convert to simple text + => filter out non-research pages using quick heuristics + => try looking up "real" page number from OCR work (in item metadata) +- generate "heavy" intermediate schema (per valid page): + => fatcat container metadata + => ia collection (journal) metadata + => item metadata + => page fulltext and any metadata + +- transform "heavy" intermediates to ES schema + +## Implementation + +Existing tools and libraries: + +- internetarchive python tool to fetch files and item metadata +- fatcat API client for container metadata lookup + +New tools or libraries needed: + +- issue DB or use fatcat search index to count releases by volume/issue +- djvu XML parser |