aboutsummaryrefslogtreecommitdiffstats
path: root/kafka
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-01-14 17:27:25 -0800
committerBryan Newbold <bnewbold@archive.org>2020-01-14 17:27:25 -0800
commit41722424932de333e5b649ccecbcde9f671610b7 (patch)
treebef9cff1da00521356208760e02ab97770fce890 /kafka
parent6a56b922ced9013c6d09027d771dc8c4fc80421e (diff)
downloadsandcrawler-41722424932de333e5b649ccecbcde9f671610b7.tar.gz
sandcrawler-41722424932de333e5b649ccecbcde9f671610b7.zip
add new bulk ingest request topic
Diffstat (limited to 'kafka')
-rw-r--r--kafka/topics.md7
1 files changed, 6 insertions, 1 deletions
diff --git a/kafka/topics.md b/kafka/topics.md
index b727e26..6646f49 100644
--- a/kafka/topics.md
+++ b/kafka/topics.md
@@ -26,11 +26,15 @@ retention (on both a size and time basis).
=> key is sha1hex of PDF. enable time compaction (6 months?)
sandcrawler-ENV.ingest-file-requests
- => ingest requests from multiple sources
+ => ingest requests from multiple sources; mostly continuous or pseudo-interactive
=> schema is JSON; see ingest proposal for fields. small objects.
=> fewer partitions with batch mode, but still a bunch (24)
=> can't think of a good key, so none. enable time compaction (3-6 months?)
+ sandcrawler-ENV.ingest-file-requests-bulk
+ => ingest requests from bulk crawl sources; background processing
+ => same as ingest-file-requests, but fewer partiions (12)
+
sandcrawler-ENV.ingest-file-results
=> ingest requests from multiple sources
=> schema is JSON; see ingest proposal for fields. small objects.
@@ -112,6 +116,7 @@ exists`; this seems safe, and the settings won't be over-ridden.
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 12 --topic sandcrawler-qa.grobid-output-pg --config compression.type=gzip --config cleanup.policy=compact
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 24 --topic sandcrawler-qa.ingest-file-requests --config retention.ms=7889400000 --config cleanup.policy=delete
+ ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 12 --topic sandcrawler-qa.ingest-file-requests-bulk --config retention.ms=7889400000 --config cleanup.policy=delete
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 6 --topic sandcrawler-qa.ingest-file-results
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 1 --topic fatcat-qa.changelog