diff options
Diffstat (limited to 'kafka/topics.md')
-rw-r--r-- | kafka/topics.md | 7 |
1 files changed, 6 insertions, 1 deletions
diff --git a/kafka/topics.md b/kafka/topics.md index b727e26..6646f49 100644 --- a/kafka/topics.md +++ b/kafka/topics.md @@ -26,11 +26,15 @@ retention (on both a size and time basis). => key is sha1hex of PDF. enable time compaction (6 months?) sandcrawler-ENV.ingest-file-requests - => ingest requests from multiple sources + => ingest requests from multiple sources; mostly continuous or pseudo-interactive => schema is JSON; see ingest proposal for fields. small objects. => fewer partitions with batch mode, but still a bunch (24) => can't think of a good key, so none. enable time compaction (3-6 months?) + sandcrawler-ENV.ingest-file-requests-bulk + => ingest requests from bulk crawl sources; background processing + => same as ingest-file-requests, but fewer partiions (12) + sandcrawler-ENV.ingest-file-results => ingest requests from multiple sources => schema is JSON; see ingest proposal for fields. small objects. @@ -112,6 +116,7 @@ exists`; this seems safe, and the settings won't be over-ridden. ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 12 --topic sandcrawler-qa.grobid-output-pg --config compression.type=gzip --config cleanup.policy=compact ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 24 --topic sandcrawler-qa.ingest-file-requests --config retention.ms=7889400000 --config cleanup.policy=delete + ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 12 --topic sandcrawler-qa.ingest-file-requests-bulk --config retention.ms=7889400000 --config cleanup.policy=delete ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 6 --topic sandcrawler-qa.ingest-file-results ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 1 --topic fatcat-qa.changelog |