diff options
author | Bryan Newbold <bnewbold@archive.org> | 2021-03-23 21:42:32 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2021-03-23 21:42:32 -0700 |
commit | 5defd444135bc4adb0748b0d2b8c9b88708bdc1a (patch) | |
tree | 599498f0a9ae5a3177d9702c3a7e8b70e39b2b4a /proposals/kafka_update_pipeline.md | |
parent | e70e7cff4b5c910405694fb297330507b49937b1 (diff) | |
download | fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.tar.gz fatcat-scholar-5defd444135bc4adb0748b0d2b8c9b88708bdc1a.zip |
proposals: add 2021 UI updates, and rename all to have a date in filename
Diffstat (limited to 'proposals/kafka_update_pipeline.md')
-rw-r--r-- | proposals/kafka_update_pipeline.md | 63 |
1 files changed, 0 insertions, 63 deletions
diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md deleted file mode 100644 index 597a1b0..0000000 --- a/proposals/kafka_update_pipeline.md +++ /dev/null @@ -1,63 +0,0 @@ - -Want to receive a continual stream of updates from both fatcat and SIM -scanning; index the updated content; and push into elasticsearch. - - -## Filtering and Affordances - -The `updated` and `fetched` timestamps are not immediately necessary or -implemented, but they can be used to filter updates. For example, after -re-loading from a build entity dump, could "roll back" update pipeline to only -fatcat (work) updates after the changelog index that the bulk dump is stamped -with. - -At least in theory, the `fetched` timestamp could be used to prevent re-updates -of existing documents in the ES index. - -The `doc_index_ts` timestamp in the ES index could be used in a future -fetch-and-reindex worker to select documents for re-indexing, or to delete -old/stale documents (eg, after SIM issue re-indexing if there were spurious -"page" type documents remaining). - -## Message Types - -Scholar Update Request JSON -- `key`: str -- `type`: str - - `fatcat_work` - - `sim_issue` -- `updated`: datetime, UTC, of event resulting in this request -- `work_ident`: str (works) -- `fatcat_changelog`: int (works) -- `sim_item`: str (items) - -"Heavy Intermediate" JSON (existing schema) -- key -- `fetched`: Optional[datetime], UTC, when this doc was collected - -Scholar Fulltext ES JSON (existing schema) - - -## Kafka Topics - -fatcat-ENV.work-ident-updates - 6x, long retention, key compaction - key: doc ident -scholar-ENV.sim-updates - 6x, long retention, key compaction - key: doc ident -scholar-ENV.update-docs - 12x, short retention (2 months?) - key: doc ident - -## Workers - -scholar-fetch-docs-worker - consumes fatcat and/or sim update requests, individually - constructs heavy intermediate - publishes to update-docs topic - -scholar-index-docs-worker - consumes updated "heavy intermediate" documents, in batches - transforms to elasticsearch schema - updates elasticsearch |