proposals: add 2021 UI updates, and rename all to have a date in filename

author: Bryan Newbold <bnewbold@archive.org> 2021-03-23 21:42:32 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2021-03-23 21:42:32 -0700
commit: 5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch)
tree: 599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/2020-05-16_fatcat_indexing_pipeline.md
parent: e70e7cff4b5c910405694fb297330507b49937b1 (diff)
download: fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz
fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip
1 files changed, 54 insertions, 0 deletions
diff --git a/proposals/2020-05-16_fatcat_indexing_pipeline.md b/proposals/2020-05-16_fatcat_indexing_pipeline.md
new file mode 100644
index 0000000..deafb65
--- /dev/null
+++ b/proposals/2020-05-16_fatcat_indexing_pipeline.md
@@ -0,0 +1,54 @@
+
+## High-Level
+
+Work-oriented: base input is arrays of expanded releases, all from the same
+work.
+
+Re-index pipeline would look at fatcat changelog or existing release feed, and
+use the `work_id` to fetch all other releases.
+
+Batch indexing pipeline would use a new variant of `fatcat-export` which is
+expanded releases (one-per-line), grouped (or sorted) by work id.
+
+Then, pipeline looks like:
+
+- choose canonical release
+- choose best access
+- choose best fulltext file
+    => iterate releases and files
+    => soft prefer canonical release, file access, release_date, etc
+    => check via postgrest query that fulltext is available
+    => fetch raw fulltext
+- check if we expect a SIM copy to exist
+    => eg, using an issue db?
+    => if so, fetch petabox metadata and try to confirm, so we can create a URL
+    => if we don't have another fulltext source (?):
+        => fetch djvu file and extract the pages in question (or just 1 if unsure?)
+- output "heavy" object
+
+Next step is:
+
+- summarize biblio metadata
+- select one abstract per language
+- sanitize abstracts and fulltext content for indexing
+- compute counts, epistimological quality, etc
+
+The output of that goes to Kafka for indexing into ES.
+
+This indexing process is probably going to be both CPU and network intensive.
+In python will want multiprocessing and maybe also async?
+
+## Implementation
+
+Existing tools/libraries:
+
+- fatcat-openapi-client
+- postgrest client
+- S3/minio/seaweed client
+- ftfy
+- language detection
+
+New needed (eventually):
+
+- strip latex
+- strip JATS or HTML
author	Bryan Newbold <bnewbold@archive.org>	2021-03-23 21:42:32 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2021-03-23 21:42:32 -0700
commit	5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch)
tree	599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/2020-05-16_fatcat_indexing_pipeline.md
parent	e70e7cff4b5c910405694fb297330507b49937b1 (diff)
download	fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip