aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-10-16 18:15:41 -0700
committerBryan Newbold <bnewbold@archive.org>2020-10-16 18:15:41 -0700
commitd9a8c44fdddacd09a2a14139ae673ad386232f3b (patch)
tree256d4dafdd935ed1f6e57cb0086cd8704b011f1b
parent7497d1baf0c3a9c24f5b9ce05c9567e555e4e6c9 (diff)
downloadfatcat-scholar-d9a8c44fdddacd09a2a14139ae673ad386232f3b.tar.gz
fatcat-scholar-d9a8c44fdddacd09a2a14139ae673ad386232f3b.zip
proposal: kafka update pipeline(s)
-rw-r--r--proposals/kafka_update_pipeline.md47
1 files changed, 47 insertions, 0 deletions
diff --git a/proposals/kafka_update_pipeline.md b/proposals/kafka_update_pipeline.md
new file mode 100644
index 0000000..86ee167
--- /dev/null
+++ b/proposals/kafka_update_pipeline.md
@@ -0,0 +1,47 @@
+
+Want to receive a continual stream of updates from both fatcat and SIM
+scanning; index the updated content; and push into elasticsearch.
+
+
+## Message Types
+
+Scholar Update Request JSON
+- `key`: str
+- `type`: str
+ - `fatcat_work`
+ - `sim_issue`
+- `updated`: datetime, UTC, of event resulting in this request
+- `work_ident`: str (works)
+- `fatcat_changelog`: int (works)
+- `sim_item`: str (items)
+
+"Heavy Intermediate" JSON (existing schema)
+- key
+- `fetched`: Optional[datetime], UTC, when this doc was collected
+
+Scholar Fulltext ES JSON (existing schema)
+
+
+## Kafka Topics
+
+fatcat-ENV.work-ident-updates
+ 6x, long retention, key compaction
+ key: doc ident
+scholar-ENV.sim-updates
+ 6x, long retention, key compaction
+ key: doc ident
+scholar-ENV.update-docs
+ 12x, short retention (2 months?)
+ key: doc ident
+
+## Workers
+
+scholar-fetch-docs-worker
+ consumes fatcat and/or sim update requests, individually
+ constructs heavy intermediate
+ publishes to update-docs topic
+
+scholar-index-docs-worker
+ consumes updated "heavy intermediate" documents, in batches
+ transforms to elasticsearch schema
+ updates elasticsearch