diff options
| author | Bryan Newbold <bnewbold@archive.org> | 2020-10-27 15:54:10 -0700 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@archive.org> | 2020-10-27 15:54:10 -0700 | 
| commit | 8e30d5ff73703a74c939b398e8c73b6f43c87fe0 (patch) | |
| tree | 3b2b921a983342844b255d1791741f896e9f8532 /kafka | |
| parent | 12a51fd28ca64338fca040ab7c470a70bf7a2a1b (diff) | |
| download | sandcrawler-8e30d5ff73703a74c939b398e8c73b6f43c87fe0.tar.gz sandcrawler-8e30d5ff73703a74c939b398e8c73b6f43c87fe0.zip  | |
kafka topics for fatcat -> scholar pipeline
Diffstat (limited to 'kafka')
| -rw-r--r-- | kafka/topics.md | 19 | 
1 files changed, 19 insertions, 0 deletions
diff --git a/kafka/topics.md b/kafka/topics.md index ebe7a61..7a34c83 100644 --- a/kafka/topics.md +++ b/kafka/topics.md @@ -110,6 +110,22 @@ retention (on both a size and time basis).      fatcat-ENV.file-updates          => key: fcid          => 4x partitions +    fatcat-ENV.work-ident-updates +        => work identifiers when updated and needs re-indexing (eg, in scholar) +        => 6x partitions +        => key: doc ident ("work_{ident}") +        => key compaction possible; long retention + +    scholar-ENV.sim-updates +        => 6x partitions +        => key: "sim_item_{}" +        => key compaction possible; long retention +    scholar-ENV.update-docs +        => 12x partitions +        => key: scholar doc identifer +        => gzip compression +        => key compaction possible +        => short time-based retention (2 months?)  ### Deprecated/Unused Topics @@ -157,6 +173,7 @@ exists`; this seems safe, and the settings won't be over-ridden.      ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 8 --topic fatcat-qa.work-updates      ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 4 --topic fatcat-qa.file-updates      ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 4 --topic fatcat-qa.container-updates +    ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 6 --topic fatcat-qa.work-ident-updates      ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 8 --topic fatcat-qa.api-crossref      ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 8 --topic fatcat-qa.api-datacite --config cleanup.policy=compact @@ -175,3 +192,5 @@ exists`; this seems safe, and the settings won't be over-ridden.      ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 12 --topic sandcrawler-qa.pdf-thumbnail-180px-jpg --config cleanup.policy=compact      ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 24 --topic sandcrawler-qa.unextracted +    ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 6 --topic scholar-qa.sim-updates +    ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 12 --topic scholar-qa.update-docs --config compression.type=gzip --config cleanup.policy=compact --config retention.ms=7889400000  | 
