summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-10-16 18:53:41 -0700
committerBryan Newbold <bnewbold@archive.org>2020-10-16 18:53:41 -0700
commitda121979c8481a5e1f6cf103e2d77363b31018c9 (patch)
treeb4b4b81be7015653b92f66f96c4fef623719410e
parent49a68238c9c7ee1ef0e142b91b0881fda058d39b (diff)
downloadfatcat-scholar-da121979c8481a5e1f6cf103e2d77363b31018c9.tar.gz
fatcat-scholar-da121979c8481a5e1f6cf103e2d77363b31018c9.zip
SQUASH: proposal
-rw-r--r--proposals/kafka_update_pipeline.md19
1 files changed, 17 insertions, 2 deletions
diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md
index 86ee167..a953d9c 100644
--- a/proposals/kafka_update_pipeline.md
+++ b/proposals/kafka_update_pipeline.md
@@ -3,11 +3,26 @@ Want to receive a continual stream of updates from both fatcat and SIM
scanning; index the updated content; and push into elasticsearch.
+## Filtering and Affordances
+
+The `updated` and `fetched` timestamps are not immediately necessary or
+implemented, but they can be used to filter updates. For example, after
+re-loading from a build entity dump, could "roll back" update pipeline to only
+fatcat (work) updates after the changelog index that the bulk dump is stamped
+with.
+
+At least in theory, the `fetched` timestamp could be used to prevent re-updates
+of existing documents in the ES index.
+
+The `doc_index_ts` timestamp in the ES index could be used in a future
+fetch-and-reindex worker to select documents for re-indexing, or to delete
+old/stale documents (eg, after SIM issue re-indexing if there were spurious
+"page" type documents remaining).
+
## Message Types
Scholar Update Request JSON
-- `key`: str
-- `type`: str
+- `key`: str - `type`: str
- `fatcat_work`
- `sim_issue`
- `updated`: datetime, UTC, of event resulting in this request