From d9a8c44fdddacd09a2a14139ae673ad386232f3b Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 16 Oct 2020 18:15:41 -0700 Subject: proposal: kafka update pipeline(s) --- proposals/kafka_update_pipeline.md | 47 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 proposals/kafka_update_pipeline.md (limited to 'proposals/kafka_update_pipeline.md') diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md new file mode 100644 index 0000000..86ee167 --- /dev/null +++ b/proposals/kafka_update_pipeline.md @@ -0,0 +1,47 @@ + +Want to receive a continual stream of updates from both fatcat and SIM +scanning; index the updated content; and push into elasticsearch. + + +## Message Types + +Scholar Update Request JSON +- `key`: str +- `type`: str + - `fatcat_work` + - `sim_issue` +- `updated`: datetime, UTC, of event resulting in this request +- `work_ident`: str (works) +- `fatcat_changelog`: int (works) +- `sim_item`: str (items) + +"Heavy Intermediate" JSON (existing schema) +- key +- `fetched`: Optional[datetime], UTC, when this doc was collected + +Scholar Fulltext ES JSON (existing schema) + + +## Kafka Topics + +fatcat-ENV.work-ident-updates + 6x, long retention, key compaction + key: doc ident +scholar-ENV.sim-updates + 6x, long retention, key compaction + key: doc ident +scholar-ENV.update-docs + 12x, short retention (2 months?) + key: doc ident + +## Workers + +scholar-fetch-docs-worker + consumes fatcat and/or sim update requests, individually + constructs heavy intermediate + publishes to update-docs topic + +scholar-index-docs-worker + consumes updated "heavy intermediate" documents, in batches + transforms to elasticsearch schema + updates elasticsearch -- cgit v1.2.3 From da121979c8481a5e1f6cf103e2d77363b31018c9 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Fri, 16 Oct 2020 18:53:41 -0700 Subject: SQUASH: proposal --- proposals/kafka_update_pipeline.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) (limited to 'proposals/kafka_update_pipeline.md') diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md index 86ee167..a953d9c 100644 --- a/proposals/kafka_update_pipeline.md +++ b/proposals/kafka_update_pipeline.md @@ -3,11 +3,26 @@ Want to receive a continual stream of updates from both fatcat and SIM scanning; index the updated content; and push into elasticsearch. +## Filtering and Affordances + +The `updated` and `fetched` timestamps are not immediately necessary or +implemented, but they can be used to filter updates. For example, after +re-loading from a build entity dump, could "roll back" update pipeline to only +fatcat (work) updates after the changelog index that the bulk dump is stamped +with. + +At least in theory, the `fetched` timestamp could be used to prevent re-updates +of existing documents in the ES index. + +The `doc_index_ts` timestamp in the ES index could be used in a future +fetch-and-reindex worker to select documents for re-indexing, or to delete +old/stale documents (eg, after SIM issue re-indexing if there were spurious +"page" type documents remaining). + ## Message Types Scholar Update Request JSON -- `key`: str -- `type`: str +- `key`: str - `type`: str - `fatcat_work` - `sim_issue` - `updated`: datetime, UTC, of event resulting in this request -- cgit v1.2.3