proposals: add 2021 UI updates, and rename all to have a date in filename

author: Bryan Newbold <bnewbold@archive.org> 2021-03-23 21:42:32 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2021-03-23 21:42:32 -0700
commit: 5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch)
tree: 599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/2020-10-20_kafka_update_pipeline.md
parent: e70e7cff4b5c910405694fb297330507b49937b1 (diff)
download: fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz
fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip
1 files changed, 63 insertions, 0 deletions
diff --git a/proposals/2020-10-20_kafka_update_pipeline.md b/proposals/2020-10-20_kafka_update_pipeline.md
new file mode 100644
index 0000000..597a1b0
--- /dev/null
+++ b/proposals/2020-10-20_kafka_update_pipeline.md
@@ -0,0 +1,63 @@
+
+Want to receive a continual stream of updates from both fatcat and SIM
+scanning; index the updated content; and push into elasticsearch.
+
+
+## Filtering and Affordances
+
+The `updated` and `fetched` timestamps are not immediately necessary or
+implemented, but they can be used to filter updates. For example, after
+re-loading from a build entity dump, could "roll back" update pipeline to only
+fatcat (work) updates after the changelog index that the bulk dump is stamped
+with.
+
+At least in theory, the `fetched` timestamp could be used to prevent re-updates
+of existing documents in the ES index.
+
+The `doc_index_ts` timestamp in the ES index could be used in a future
+fetch-and-reindex worker to select documents for re-indexing, or to delete
+old/stale documents (eg, after SIM issue re-indexing if there were spurious
+"page" type documents remaining).
+
+## Message Types
+
+Scholar Update Request JSON
+- `key`: str
+- `type`: str
+    - `fatcat_work`
+    - `sim_issue`
+- `updated`: datetime, UTC, of event resulting in this request
+- `work_ident`: str (works)
+- `fatcat_changelog`: int (works)
+- `sim_item`: str (items)
+
+"Heavy Intermediate" JSON (existing schema)
+- key
+- `fetched`: Optional[datetime], UTC, when this doc was collected
+
+Scholar Fulltext ES JSON (existing schema)
+
+
+## Kafka Topics
+
+fatcat-ENV.work-ident-updates
+    6x, long retention, key compaction
+    key: doc ident
+scholar-ENV.sim-updates
+    6x, long retention, key compaction
+    key: doc ident
+scholar-ENV.update-docs
+    12x, short retention (2 months?)
+    key: doc ident
+
+## Workers
+
+scholar-fetch-docs-worker
+    consumes fatcat and/or sim update requests, individually
+    constructs heavy intermediate
+    publishes to update-docs topic
+
+scholar-index-docs-worker
+    consumes updated "heavy intermediate" documents, in batches
+    transforms to elasticsearch schema
+    updates elasticsearch
author	Bryan Newbold <bnewbold@archive.org>	2021-03-23 21:42:32 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2021-03-23 21:42:32 -0700
commit	5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch)
tree	599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/2020-10-20_kafka_update_pipeline.md
parent	e70e7cff4b5c910405694fb297330507b49937b1 (diff)
download	fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip