From 5defd444135bc4adb0748b0d2b8c9b88708bdc1a Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 23 Mar 2021 21:42:32 -0700 Subject: proposals: add 2021 UI updates, and rename all to have a date in filename --- proposals/fatcat_indexing_pipeline.md | 54 ----------------------------------- 1 file changed, 54 deletions(-) delete mode 100644 proposals/fatcat_indexing_pipeline.md (limited to 'proposals/fatcat_indexing_pipeline.md') diff --git a/proposals/fatcat_indexing_pipeline.md b/proposals/fatcat_indexing_pipeline.md deleted file mode 100644 index deafb65..0000000 --- a/proposals/fatcat_indexing_pipeline.md +++ /dev/null @@ -1,54 +0,0 @@ - -## High-Level - -Work-oriented: base input is arrays of expanded releases, all from the same -work. - -Re-index pipeline would look at fatcat changelog or existing release feed, and -use the `work_id` to fetch all other releases. - -Batch indexing pipeline would use a new variant of `fatcat-export` which is -expanded releases (one-per-line), grouped (or sorted) by work id. - -Then, pipeline looks like: - -- choose canonical release -- choose best access -- choose best fulltext file - => iterate releases and files - => soft prefer canonical release, file access, release_date, etc - => check via postgrest query that fulltext is available - => fetch raw fulltext -- check if we expect a SIM copy to exist - => eg, using an issue db? - => if so, fetch petabox metadata and try to confirm, so we can create a URL - => if we don't have another fulltext source (?): - => fetch djvu file and extract the pages in question (or just 1 if unsure?) -- output "heavy" object - -Next step is: - -- summarize biblio metadata -- select one abstract per language -- sanitize abstracts and fulltext content for indexing -- compute counts, epistimological quality, etc - -The output of that goes to Kafka for indexing into ES. - -This indexing process is probably going to be both CPU and network intensive. -In python will want multiprocessing and maybe also async? - -## Implementation - -Existing tools/libraries: - -- fatcat-openapi-client -- postgrest client -- S3/minio/seaweed client -- ftfy -- language detection - -New needed (eventually): - -- strip latex -- strip JATS or HTML -- cgit v1.2.3