proposals: add 2021 UI updates, and rename all to have a date in filename

author: Bryan Newbold <bnewbold@archive.org> 2021-03-23 21:42:32 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2021-03-23 21:42:32 -0700
commit: 5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch)
tree: 599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/fatcat_indexing_pipeline.md
parent: e70e7cff4b5c910405694fb297330507b49937b1 (diff)
download: fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz
fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip
1 files changed, 0 insertions, 54 deletions
diff --git a/proposals/fatcat_indexing_pipeline.md b/proposals/fatcat_indexing_pipeline.md
deleted file mode 100644
index deafb65..0000000
--- a/proposals/fatcat_indexing_pipeline.md
+++ /dev/null
@@ -1,54 +0,0 @@
-
-## High-Level
-
-Work-oriented: base input is arrays of expanded releases, all from the same
-work.
-
-Re-index pipeline would look at fatcat changelog or existing release feed, and
-use the `work_id` to fetch all other releases.
-
-Batch indexing pipeline would use a new variant of `fatcat-export` which is
-expanded releases (one-per-line), grouped (or sorted) by work id.
-
-Then, pipeline looks like:
-
-- choose canonical release
-- choose best access
-- choose best fulltext file
-    => iterate releases and files
-    => soft prefer canonical release, file access, release_date, etc
-    => check via postgrest query that fulltext is available
-    => fetch raw fulltext
-- check if we expect a SIM copy to exist
-    => eg, using an issue db?
-    => if so, fetch petabox metadata and try to confirm, so we can create a URL
-    => if we don't have another fulltext source (?):
-        => fetch djvu file and extract the pages in question (or just 1 if unsure?)
-- output "heavy" object
-
-Next step is:
-
-- summarize biblio metadata
-- select one abstract per language
-- sanitize abstracts and fulltext content for indexing
-- compute counts, epistimological quality, etc
-
-The output of that goes to Kafka for indexing into ES.
-
-This indexing process is probably going to be both CPU and network intensive.
-In python will want multiprocessing and maybe also async?
-
-## Implementation
-
-Existing tools/libraries:
-
-- fatcat-openapi-client
-- postgrest client
-- S3/minio/seaweed client
-- ftfy
-- language detection
-
-New needed (eventually):
-
-- strip latex
-- strip JATS or HTML
author	Bryan Newbold <bnewbold@archive.org>	2021-03-23 21:42:32 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2021-03-23 21:42:32 -0700
commit	5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch)
tree	599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/fatcat_indexing_pipeline.md
parent	e70e7cff4b5c910405694fb297330507b49937b1 (diff)
download	fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip