diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-10-16 18:53:41 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-10-16 18:53:41 -0700 |
commit | da121979c8481a5e1f6cf103e2d77363b31018c9 (patch) | |
tree | b4b4b81be7015653b92f66f96c4fef623719410e /proposals | |
parent | 49a68238c9c7ee1ef0e142b91b0881fda058d39b (diff) | |
download | fatcat-scholar-da121979c8481a5e1f6cf103e2d77363b31018c9.tar.gz fatcat-scholar-da121979c8481a5e1f6cf103e2d77363b31018c9.zip |
SQUASH: proposal
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/kafka_update_pipeline.md | 19 |
1 files changed, 17 insertions, 2 deletions
diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md index 86ee167..a953d9c 100644 --- a/proposals/kafka_update_pipeline.md +++ b/proposals/kafka_update_pipeline.md @@ -3,11 +3,26 @@ Want to receive a continual stream of updates from both fatcat and SIM scanning; index the updated content; and push into elasticsearch. +## Filtering and Affordances + +The `updated` and `fetched` timestamps are not immediately necessary or +implemented, but they can be used to filter updates. For example, after +re-loading from a build entity dump, could "roll back" update pipeline to only +fatcat (work) updates after the changelog index that the bulk dump is stamped +with. + +At least in theory, the `fetched` timestamp could be used to prevent re-updates +of existing documents in the ES index. + +The `doc_index_ts` timestamp in the ES index could be used in a future +fetch-and-reindex worker to select documents for re-indexing, or to delete +old/stale documents (eg, after SIM issue re-indexing if there were spurious +"page" type documents remaining). + ## Message Types Scholar Update Request JSON -- `key`: str -- `type`: str +- `key`: str - `type`: str - `fatcat_work` - `sim_issue` - `updated`: datetime, UTC, of event resulting in this request |