summaryrefslogtreecommitdiffstats
path: root/proposals/kafka_update_pipeline.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-03-23 21:42:32 -0700
committerBryan Newbold <bnewbold@archive.org>2021-03-23 21:42:32 -0700
commit5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch)
tree599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/kafka_update_pipeline.md
parente70e7cff4b5c910405694fb297330507b49937b1 (diff)
downloadfatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz
fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip
proposals: add 2021 UI updates, and rename all to have a date in filename
Diffstat (limited to 'proposals/kafka_update_pipeline.md')
-rw-r--r--proposals/kafka_update_pipeline.md63
1 files changed, 0 insertions, 63 deletions
diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md
deleted file mode 100644
index 597a1b0..0000000
--- a/proposals/kafka_update_pipeline.md
+++ /dev/null
@@ -1,63 +0,0 @@
-
-Want to receive a continual stream of updates from both fatcat and SIM
-scanning; index the updated content; and push into elasticsearch.
-
-
-## Filtering and Affordances
-
-The `updated` and `fetched` timestamps are not immediately necessary or
-implemented, but they can be used to filter updates. For example, after
-re-loading from a build entity dump, could "roll back" update pipeline to only
-fatcat (work) updates after the changelog index that the bulk dump is stamped
-with.
-
-At least in theory, the `fetched` timestamp could be used to prevent re-updates
-of existing documents in the ES index.
-
-The `doc_index_ts` timestamp in the ES index could be used in a future
-fetch-and-reindex worker to select documents for re-indexing, or to delete
-old/stale documents (eg, after SIM issue re-indexing if there were spurious
-"page" type documents remaining).
-
-## Message Types
-
-Scholar Update Request JSON
-- `key`: str
-- `type`: str
- - `fatcat_work`
- - `sim_issue`
-- `updated`: datetime, UTC, of event resulting in this request
-- `work_ident`: str (works)
-- `fatcat_changelog`: int (works)
-- `sim_item`: str (items)
-
-"Heavy Intermediate" JSON (existing schema)
-- key
-- `fetched`: Optional[datetime], UTC, when this doc was collected
-
-Scholar Fulltext ES JSON (existing schema)
-
-
-## Kafka Topics
-
-fatcat-ENV.work-ident-updates
- 6x, long retention, key compaction
- key: doc ident
-scholar-ENV.sim-updates
- 6x, long retention, key compaction
- key: doc ident
-scholar-ENV.update-docs
- 12x, short retention (2 months?)
- key: doc ident
-
-## Workers
-
-scholar-fetch-docs-worker
- consumes fatcat and/or sim update requests, individually
- constructs heavy intermediate
- publishes to update-docs topic
-
-scholar-index-docs-worker
- consumes updated "heavy intermediate" documents, in batches
- transforms to elasticsearch schema
- updates elasticsearch