From 5defd444135bc4adb0748b0d2b8c9b88708bdc1a Mon Sep 17 00:00:00 2001
From: Bryan Newbold <bnewbold@archive.org>
Date: Tue, 23 Mar 2021 21:42:32 -0700
Subject: proposals: add 2021 UI updates, and rename all to have a date in
 filename

---
 proposals/fatcat_indexing_pipeline.md | 54 -----------------------------------
 1 file changed, 54 deletions(-)
 delete mode 100644 proposals/fatcat_indexing_pipeline.md

(limited to 'proposals/fatcat_indexing_pipeline.md')

diff --git a/proposals/fatcat_indexing_pipeline.md b/proposals/fatcat_indexing_pipeline.md
deleted file mode 100644
index deafb65..0000000
--- a/proposals/fatcat_indexing_pipeline.md
+++ /dev/null
@@ -1,54 +0,0 @@
-
-## High-Level
-
-Work-oriented: base input is arrays of expanded releases, all from the same
-work.
-
-Re-index pipeline would look at fatcat changelog or existing release feed, and
-use the `work_id` to fetch all other releases.
-
-Batch indexing pipeline would use a new variant of `fatcat-export` which is
-expanded releases (one-per-line), grouped (or sorted) by work id.
-
-Then, pipeline looks like:
-
-- choose canonical release
-- choose best access
-- choose best fulltext file
-    => iterate releases and files
-    => soft prefer canonical release, file access, release_date, etc
-    => check via postgrest query that fulltext is available
-    => fetch raw fulltext
-- check if we expect a SIM copy to exist
-    => eg, using an issue db?
-    => if so, fetch petabox metadata and try to confirm, so we can create a URL
-    => if we don't have another fulltext source (?):
-        => fetch djvu file and extract the pages in question (or just 1 if unsure?)
-- output "heavy" object
-
-Next step is:
-
-- summarize biblio metadata
-- select one abstract per language
-- sanitize abstracts and fulltext content for indexing
-- compute counts, epistimological quality, etc
-
-The output of that goes to Kafka for indexing into ES.
-
-This indexing process is probably going to be both CPU and network intensive.
-In python will want multiprocessing and maybe also async?
-
-## Implementation
-
-Existing tools/libraries:
-
-- fatcat-openapi-client
-- postgrest client
-- S3/minio/seaweed client
-- ftfy
-- language detection
-
-New needed (eventually):
-
-- strip latex
-- strip JATS or HTML
-- 
cgit v1.2.3