diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-10-16 18:15:41 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-10-16 18:15:41 -0700 |
commit | d9a8c44fdddacd09a2a14139ae673ad386232f3b (patch) | |
tree | 256d4dafdd935ed1f6e57cb0086cd8704b011f1b /proposals | |
parent | 7497d1baf0c3a9c24f5b9ce05c9567e555e4e6c9 (diff) | |
download | fatcat-scholar-d9a8c44fdddacd09a2a14139ae673ad386232f3b.tar.gz fatcat-scholar-d9a8c44fdddacd09a2a14139ae673ad386232f3b.zip |
proposal: kafka update pipeline(s)
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/kafka_update_pipeline.md | 47 |
1 files changed, 47 insertions, 0 deletions
diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md new file mode 100644 index 0000000..86ee167 --- /dev/null +++ b/proposals/kafka_update_pipeline.md @@ -0,0 +1,47 @@ + +Want to receive a continual stream of updates from both fatcat and SIM +scanning; index the updated content; and push into elasticsearch. + + +## Message Types + +Scholar Update Request JSON +- `key`: str +- `type`: str + - `fatcat_work` + - `sim_issue` +- `updated`: datetime, UTC, of event resulting in this request +- `work_ident`: str (works) +- `fatcat_changelog`: int (works) +- `sim_item`: str (items) + +"Heavy Intermediate" JSON (existing schema) +- key +- `fetched`: Optional[datetime], UTC, when this doc was collected + +Scholar Fulltext ES JSON (existing schema) + + +## Kafka Topics + +fatcat-ENV.work-ident-updates + 6x, long retention, key compaction + key: doc ident +scholar-ENV.sim-updates + 6x, long retention, key compaction + key: doc ident +scholar-ENV.update-docs + 12x, short retention (2 months?) + key: doc ident + +## Workers + +scholar-fetch-docs-worker + consumes fatcat and/or sim update requests, individually + constructs heavy intermediate + publishes to update-docs topic + +scholar-index-docs-worker + consumes updated "heavy intermediate" documents, in batches + transforms to elasticsearch schema + updates elasticsearch |